The ChipList, by Adrian Offerman; The Processor Portal

new
Processor Selector

Platform:
Segment:
Tree: collapse / expand
View: show / edit

bookmark bookmark site
bookmark permalink

21  Intel Itanium processors

Overview


Itanium processor overview Itanium classic Merced Itanium 2 McKinley Madison Hondo Deerfield Madison 9M Fanwood Fanwood LV Itanium 2 9000 series Montecito Millington Millington LV


True 64 bit processor (contrary to AMD64 and EM64T, being 64 bit extensions to the IA-32 architecture).
EPIC architecture (Explicitly Parallel Instruction Computing): tight coupling between hardware (processor) and software (compiler).
Performance depending on compilers generating efficient code (Instruction Level Parallellism, ILP), requiring more intelligence in the compiler back-end.
Itanium Instruction Set Architecture (ISA), Itanium System Environment (ISE).

History

Initiated by HP in 1990 (code name PA-WideWord, PA-WW), jointly developed by Intel and HP from 1993, originally as a follow-up to their respective X86 and PA-RISC processors. Other processors that would be obsoleted by Itanium were DEC/Compaq Alpha and SGI MIPS.

Itanium has problems gaining market share since its introduction in 2001. Back then, the first items were two years delayed. Its performance was poor, in particular of the IA-32 compatibility unit, which was buggy as well. SMP configurations (Symmetric Multi-Processing) were limited to four processors. And not enough native Itanium applications were available.

This forced HP to extend its PA-RISC roadmap with an extra generation (8900 series), and forced SGI to add even another two generations to its MIPS roadmap.

Sun cancelled its Solaris port for Itanium. And with Linux emerging, IBM, SCO, and Sequent cancelled their joint effort to combine their respective AIX, UnixWare, and PTX UNIXes into a single operating system (Monterey-64).

Finally, after AMD introduced 64 bits extensions (AMD64) to its processors, Intel was forced to follow with EM64T, pushing Itanium further up and away from the server market segment dominated by Intel Xeon and AMD Opteron processors.

Today's market

Nowadays, all HP engineers working on Itanium have been transferred to Intel.
Itanium is only sold in high-end server and High Performance Computing (HPC) markets.
Since IBM and Dell dropped the Itanium servers from their product portfolios in 2005, by far most of the Itanium systems are being sold by HP. That makes Itanium mostly a replacement for HPs Alpha (DEC/Compaq) and PA-RISC processors, competing against IBM Power and Sun/Fujitsu UltraSparc. Other manufacturers selling Itanium systems are Bull, Fujitsu, Hitachi, NEC (together with Unisys), and SGI.

Today, Itanium is supported by Linux (Trillian, Red Hat, and SuSE), Compaq Tru64 Unix (HP), and HP-UX of course. Currently SGI is in a transition from its traditional IRIX on MIPS systems to Linux on Itanium.
Longhorn Server will support Itanium only for specific workloads: databases, custom jobs, and line-of-business applications.

In 2005, the companies currently manufacturing or selling Itanium hardware founded the Itanium Solutions Alliance, to promote the availability and acceptance of Itanium solutions in the market.


21.1  Intel Itanium "classic" processor

Compatibility


256 application registers:

  • 128 64 bit general purpose registers (integer and multimedia):
    32 general registers (GR0 - GR31): static, available to all programs,
    rest (GR32 - GR127) stacked: available per program,
    managed by Register Stack Engine (RSE) (stack pointer (SP): Current Frame Marker, CFM),
  • 128 82 bit floating point registers (FR0 - FR127):
    first 32 registers static: available to all programs,
    rest rotating: can be renamed to accelerate loops.

64 predicate registers (PR0 - PR63): contain predicate test (compare) results, for conditional execution of instructions,
first 16 registers static: available to all programs,
rest rotating: can be renamed to accelerate loops.

8 branch registers (BR0 - BR7).

128 application registers (AR0 - AR127): special-purpose data and control registers.

4 Privilege Levels (PL): 0-3.
Current Privilege Level (CPL) in PSR.cpl (Processor Status Register, PSR).

Bi-endian memory access: controlled by UM.be bit (User Mask, UM).

Memory mapped I/O.

Processor virtualization: enabled by PSR.vm bit, managed by PAL.
Virtual Machine Monitor (VMM): managing and virtualizing system resources, creating a virtual environment (Virtual Processor Descriptor, VPD).

IA-32 compatibility mode: IA-32 System Environment, i.e. Pentium III.
16 bit Real Mode, 16 bit VM86, 16/32 bit Protected Mode, memory segmentation.
Multimedia instruction sets: MMX, SSE.
Switch between Itanium and IA-32 instruction sets using JMPE, br.ia, and rtfi.
All interruptions handled by Itanium instruction set code.
Current execution mode in PSR.is.

PA-RISC supported through Aries emulator.

Operating system: supported through Extensible Firmware Interface (EFI).
System Abstraction Layer (SAL): firmware providing platform initialization, configuration, and test, operating system boot, run-time functionality (i.e. BIOS (Basic Input Output System), Machine Checks, and Platform Management Interruptions (PMI, successor IA-32 System Management Mode (SMM))).
Processor Abstraction Layer (PAL): firmware providing processor specific Machine Checks, initialization, PMI, power management, configuration, and error recovery.

Developer's Interface Guide for IA-64 Servers (DIG64): design guidelines for building blocks and interfaces of IA-64 systems, providing an interoperable and stable baseline hardware interface for software developers.

Cache


On-die L1 cache (Harvard architecture):

  • 16 kbyte instruction cache (L1I):
    4-way set-associative, 32 byte line size,
    1 cycle latency,
  • 16 kbyte data cache (L1D):
    4-way set-associative, 32 byte line size,
    write-through, no write-allocate,
    2 cycles latency for integers, FPs bypass the L1 data cache.

On-die, unified L2 cache:
96 kbyte,
6-way set-associative, 64 byte line size,
write back,
6 cycles minimum latency for integers, 9 cycles minimum latency for FPs,
max. 2 requests per clock (banks).
Cache coherency through MESI protocol.

L3 cache:
2 or 4 Mbyte, apart in package, connected through Front Side Bus (FSB),
4-way set-associative, 64 byte line size,
21 cycles minimum latency for integers, 24 cycles minimum latency for FPs,
bandwidth 16 bytes per core cycle (64 bit DDR, 128 bit bus to core),
12 Gbyte/s max. throughput.
Cache coherency through MESI protocol.

Translation Look-aside Buffer (TLB) and Virtual Hash Page Table (VHPT):

  • Instruction TLB (ITLB): between instruction fetch and decode,
    64 entry, fully associative,
  • two-level data TLB: between data caches and registers:
    • L1 DTLB (DTLB1):
      32 entry, fully associative,
      10 cycles penalty at miss,
    • L2 DTLB (DTLB2):
      96 entry, fully associative,
      page size 4 kbyte - 256 Mbyte supported.
Hardware Page Walker (HPW): loads VHPT from L2 cache / L3 cache / memory at TLB misses.

Advanced Load Address Table (ALAT): between L1 data cache (L1D) and DTLB, keeps track of speculative data loads,
32 entry, two-way set-associative.

Architecture


Double pipeline: 10 stage in-order, 6 instructions wide.
Split issue dispersal: three instructions (16 bytes) per bundle.
Scoreboarding, non-blocking caches (for compile-time non-determinism).

17 execution units:

  • 4 Integer units (ALU, Arithmetic Logic Unit),
  • 4 Multimedia units (ALU),
  • 2 Extended Precision Floating Point (FP) units: F0, F1,
    ANSI/IEEE-754,
    FMAC: Floating Point Multiply Add Calculation: multiply and add of 82 bit floating point values in one cycle (for matrix calculations),
  • 2 Single Precision FPUs,
    each executing two calculations per clock,
  • all FP units together delivering max. throughput of 6.4 GFLOPS,
  • 2 load/store units: M0, M1,
  • 3 branch units: B0, B1, B2.

9 issue ports:

  • 2 memory: M0, M1,
  • 2 integer: I0, I1,
  • 2 FP: F0, F1,
  • 3 branch: B0, B1, B2,
serving the 17 execution units above.

Dynamic prefetch, branch prediction, speculative execution.

Branch prediction:
512 entry, two-level.
Branch Target Address Cache (BTAC): 64 entry.

Interval Time Counter (ITC): register for timing ticks.
In 32 bit compatibility mode: Time Stamp Counter (TSC).

Streamlined Advanced PIC (SAPIC): based on IA-32 APIC (Advanced Programmable Interrupt Controller),
for Aborts, Interrupts, Faults, and Traps:

  • handled by operating system: to Interrupt Vector Address (IVA) through Interrupt Vector Table (IVT),
  • handled by PAL firmware.
Interruption Status Register (ISR).
256 interrupt vectors:
  • 0 - 15: special, high priority,
  • 16 - 255: freely assignable.
Support for Intel 8259A interrupt controllers.

Virtual address space: 64 bit, no segmentation.
Multiple Address Space (MAS): each process has its own unique Virtual Region (flat linear address space).
8 61 bit Virtual Regions (Virtual Region Number, VRN; Region Identifier, RID), 224 Virtual Address Spaces of 261 bits.
4 kbyte - 256 Mbyte pages (Virtual Page Number, VPN).

Physical address space: 63 bit.
Up to 50 bits supported in page tables.

Write Coalescing (WC): streams of non-cachable writes can be combined into a single bus write transaction.
WC Buffer (WCB): two-entry, 64 byte.

Enhanced Machine Check Architecture (EMCA): parity and ECC (Error-Correcting Code) on all major address and data busses.

44 bit address bus.
Physical addressing:

  • 32 bit: 0-4 Gbyte,
  • 36 bit: 4-64 Gbyte,
  • 44 bit: 64 Gbyte - 16 Tbyte.
Virtual addressing: 54 bit.
Page sizes: 4 kbyte - 256 Mbyte.

133 MHz DDR bus (Merced bus): 64 bit data.
Source Synchronous Signaling (SSS).
2.1 Gbyte/s max. throughput.

Assisted Gunning Transceiver Logic signaling (AGTL+),
based on GTL+ bus of Intel Pentium III and Pentium III Xeon processors.
1.5 V ± 1.5 %.

Power pod connector.

Tests:

  • Build-In Self Test (BIST),
  • Test Access Port (TAP): IEEE 1149.1 (JTAG),
  • In-Target Probe (ITP): debugging interface for board integration,
    JTAG TAP, access to registers, memory, and I/O,
    ITP700 Debug Port (DB): command and control interface for ITP,
    max. 16 MHz,
  • Logic Analyzer Interface (LAI),
  • code debugging:
    Instruction and Data Breakpoint Registers (IBR, DBR),
    single stepping (through PSR.ss),
    breaks, taken branches (through PSR.tb), privileges,
    instruction and data debugging.

Processor performance monitoring and profiling:

  • Performance Monitor Configuration (PMC),
  • Performance Monitor Data Registers (PMD),
  • 4 32 bit performance counters.
Dynamic processor behaviour (instruction execution, caches, branch prediction, virtual memory translation) can be monitored with real-world operating systems, applications, and systems, and be fed back into the code generation process.

Multi-processing


SMP (Symmetric Multi-Processing): glueless up to four processors (max. 16 in IA-32 compatibility mode).
Shared memory, cache coherency through MESI protocol.

Multiplier


Multiplier (Phase Lock Loop, PLL):
set through pins during reset:

multiplier\pin LINT[1] LINT[0] IGNNE# A20M#
2/11 0000
2/12 0111

Power management


Power and performance management:
P-states:

  • P0: maximum performance, maximum power (highest utilization),
  • P15 (lowest utilization),
set for all logical processors (multi-threading, multiple cores), per dependency domain (depending on distribution network for clock and power),
managed by PAL.

Performance


Performance:

  • slow integer (comparable to X86 processors),
  • very fast FP units,
  • very slow IA-32 unit,
  • L3 cache slow due to small memory bus,
resulting in poor over-all performance.

Thermal management


Thermal management: via on-die thermal diode:

  • Thermal Alert: thresholds set through SMBus,
    THRMALERT# pin active when threshold crossed,
  • no Enhanced Thermal Management (ETM),
  • Thermal Trip: processor shutdown when overheated,
    THRMTRIP# pin active, reset processor to resume.

System management


System management: System Management Bus (SMBus):

  • Processor Information EEPROM (PIROM): manufacturing and features information,
    permanently write-protected:
    • processor: s-spec / QDF number, sample/production,
    • core: architecture revision, family, model, stepping/revision,
      maximum core frequency, maximum bus frequency, voltage, voltage tolerance,
    • L3 cache: size, voltage, tolerance, stepping,
    • package: cartridge revision, substrate revision,
    • part numbers: processor part number (80542KC),
      processor electronic signature (64 bit serial number),
    • thermal reference,
    • features, IA-32 features, cartridge features,
  • scratch EEPROM: for OEM system designer information,
  • thermal sensing device (A/D converter), connected to on-die thermal diode.
3.3 V ± 5% (3.14-3.47 V).

Marking


Marking:

  • Intel brand,
  • legal mark,
  • product ID,
  • Finish Process Order (FPO),
  • serial number,
  • s-spec,
  • country of origin (not for 9000 series),
  • Assembly Process Order (APO).

CPUID


CPUID: 8 byte registers:

  • registers 0-4: fixed region,
  • region 5 and further: variable region.

  • registers 0 and 1: vendor id information,
  • register 2: ignored,
  • register 3: processor implementation information:
    • bits 7:0: largest CPUID register number,
    • bits 15:8: processor revision number,
    • bits 23:16: processor model number,
    • bits 31:24: processor family number (0x07),
    • bits 39:32: processor architecture revision number (0x00),
    • bits 63:40: reserved,
  • register 4: processor features:
    • bit 0: long branch instruction (brl) implemented, no need to emulate by operating system,
    • bit 1: spontaneous deferral implemented,
    • bit 2: 16-byte atomic operations implemented,
    • bits 63:3: reserved.

CPUID return values:

0x10 L1D: 16 kbyte, 4-way set-associative, 32 byte line size
0x15 L1I: 16 kbyte, 4-way set-associative, 32 byte line size
0x1A L2: 96 kbyte, 6-way set-associative, 64 byte line size
0x88 L3: 2 Mbyte, 4-way set-associative, 64 byte line size
0x89 L2: 4 Mbyte, 4-way set-associative, 64 byte line size
0x8A L2: 8 Mbyte, 4-way set-associative, 64 byte line size
0x90 ITLB: 64 entry, fully associative, 4 kbyte - 256 Mbyte pages
0x96 DTLB0: 32 entry, fully associative, 4 kbyte - 256 Mbyte pages
0x9B DTLB1: 96 entry, fully associative, 4 kbyte - 256 Mbyte pages

Set EAX register to 2, then returned in EAX, EBX, ECX, EDX registers (MSB - LSB):

EAX 0x00 0x15 0x10 0x00
EBX 0x00 0x00 0x88/0x89 0x00
ECX 0x00 0x9B 0x00 0x00
EDX 0x80 0x00 0x00 0x00

Market


Used by HP as PA-RISC replacement, and in High Performance Computing (HPC).

Only a few thousand delivered.
Succeeded by Itanium 2 in 2002.


21.1.1  Intel Itanium processor (Merced)

21.2  Intel Itanium 2 processor

Successor of Itanium processor.
Major revision:

  • improvements in number of pipeline units, number of clock cycles,
  • according to Intel and HP one-and-a-half to two times faster than the Itanium processor,
  • brl instruction: 64 bit long branch,
  • scoreboarding of multi-cycle instructions, e.g. L1D misses, multimedia, FP.

Compatibility


256 application registers:

  • 128 64 bit general purpose registers (integer and multimedia):
    32 general registers (GR0 - GR31): static, available to all programs,
    rest (GR32 - GR127) stacked: available per program,
    managed by Register Stack Engine (RSE) (stack pointer (SP): Current Frame Marker, CFM),
  • 128 82 bit floating point registers (FR0 - FR127):
    first 32 registers static: available to all programs,
    rest rotating: can be renamed to accelerate loops.

64 predicate registers (PR0 - PR63): contain predicate test (compare) results, for conditional execution of instructions,
first 16 registers static: available to all programs,
rest rotating: can be renamed to accelerate loops.

8 branch registers (BR0 - BR7).

128 application registers (AR0 - AR127): special-purpose data and control registers.

4 Privilege Levels (PL): 0-3.
Current Privilege Level (CPL) in PSR.cpl (Processor Status Register, PSR).

Bi-endian memory access: controlled by UM.be bit (User Mask, UM).

Memory mapped I/O.

Processor virtualization: enabled by PSR.vm bit, managed by PAL.
Virtual Machine Monitor (VMM): managing and virtualizing system resources, creating a virtual environment.
From Montecito: Intel VT for Itanium (VT-i),
Virtual Processor Descriptor (VPD): description of resources of a single virtual processor,
rest of Virtual Processor State (VPS) maintained by VMM.

IA-32 compatibility mode: IA-32 System Environment, i.e. Pentium III.
16 bit Real Mode, 16 bit VM86, 16/32 bit Protected Mode, memory segmentation.
Multimedia instruction sets: MMX, SSE, SSE2 (from Intel Itanium 2 9000 series processor).
Switch between Itanium and IA-32 instruction sets using JMPE, br.ia, and rtfi.
All interruptions handled by Itanium instruction set code.
Current execution mode in PSR.is.
From Madison, IA-32 support implemented in software, as part of operating system (IA-32 Execution Layer, EL),
IA-32 EL provided by Intel for Linux and Windows,
erratum: segmentation not supported in IA-32 EL versions 4.3, 4.4, 5.3, and 6.5,
erratum: 16 bit application mode not supported in IA-32 EL versions 4.3, 4.4, 5.3, and 6.5,
note: CPUID returns only manufacturer and family of emulated processor model.

PA-RISC supported through Aries emulator.

Operating system: supported through Extensible Firmware Interface (EFI).
System Abstraction Layer (SAL): firmware providing platform initialization, configuration, and test, operating system boot, run-time functionality (i.e. BIOS (Basic Input Output System), Machine Checks, and Platform Management Interruptions (PMI, successor IA-32 System Management Mode (SMM))).
Processor Abstraction Layer (PAL): firmware providing processor specific Machine Checks, initialization, PMI, power management, configuration, and error recovery.

Developer's Interface Guide for IA-64 Servers (DIG64): design guidelines for building blocks and interfaces of IA-64 systems, providing an interoperable and stable baseline hardware interface for software developers.

Cache


On-die L1 cache (Harvard architecture):

  • 16 kbyte instruction cache (L1I):
    4-way set-associative, 64 byte line size,
    1 cycle latency,
    32 Gbyte/s max. reading speed.
  • 16 kbyte data cache (L1D):
    4-way set-associative, 64 byte line size,
    write-through, no write-allocate,
    1 cycle latency for integers, FPs and semaphores bypass the L1 data cache,
    32 Gbyte/s max. reading speed, 16 Gbyte/s max. writing speed.

On-die, unified L2 cache:
256 kbyte,
8-way set-associative, 128 byte line size,
write back, write-allocate,
non-blocking, out-of-order,
5 cycles minimum latency for integers, 6 cycles minimum latency for FPs,
16 byte banks,
32 Gbyte/s max. reading speed.
Cache coherency through MESI protocol.

From Itanium 2 9000 series processors: on-die L2 cache (Harvard architecture):

  • 1 Mbyte instruction cache (L2I):
    8-way set-associative, 128 byte line size,
    7 cycle latency,
  • 256 kbyte data cache (L2D):
    8-way set-associative, 128 byte line size,
    write-back, write-allocate,
    non-blocking, out-of-order,
    5 cycles minimum latency for integers, 6 cycles minimum latency for FPs,
    16 byte banks,
    32 Gbyte/s max. reading speed.
Cache coherency through MESI protocol.
Intel Cache Safe Technology: protection of data and tags: double bit detection, single bit correction (ECC, Error-Correcting Code).

On-die, unified L3 cache:
up to 2x 12 = 24 Mbyte,
McKinley and Madison: 4-way set-associative per Mbyte, Madison 9M: 2-way set-associative per Mbyte; 128 byte line size,
fully pipelined, non-blocking,
McKinley: 12 cycles minimum latency for integers, 13 cycles minimum latency for FPs; Madison and Madison 9M: 14 cycles minimum latency for integers, 15 cycles minimum latency for FPs; Montecito: 14 cycles latency,
bandwidth to core 32 bytes per core cycle (256 bit bus to core),
6.2 Gbyte/s max. traffic speed to memory,
providing data to core at up to 48 Gbyte/s.
Cache coherency through MESI protocol,
Intel Cache Safe Technology: protection of data and tags: double bit detection, single bit correction (ECC, Error-Correcting Code).

Translation Look-aside Buffer (TLB) and Virtual Hash Page Table (VHPT):

  • Two-level instruction TLB (ITLB): between instruction fetch and decode:
    • L1 ITLB (ITLB1):
      64 entry, fully associative,
      only page size of 4 kbyte supported,
      2 cycles latency,
    • L2 ITLB (ITLB2):
      128 entry, fully associative,
      page size 4 kbyte - 256 Mbyte supported.
  • two-level data TLB: between data caches and registers:
    • L1 DTLB (DTLB1):
      32 entry, fully associative,
      2 cycles penalty at miss,
      only page size of 4 kbyte supported,
    • L2 DTLB (DTLB2):
      128 entry, fully associative,
      page size 4 kbyte - 4 Gbyte supported.
Hardware Page Walker (HPW): loads VHPT from L2 cache / L3 cache / memory at TLB misses.

Advanced Load Address Table (ALAT): between L1 data cache (L1D) and DTLB, keeps track of speculative data loads,
32 entry, fully associative.

Architecture


Double pipeline: 8 stage in-order, 6 instructions wide.
Split issue dispersal: three instructions (16 bytes) per bundle.
Scoreboarding, non-blocking caches (for compile-time non-determinism).

Execution units, all fully pipelined:

  • 6 Integer units (ALU, Arithmetic Logic Unit): ALU0-6,
    1 cycle latency,
  • 6 Multimedia units (ALU): PALU0-5 (compare HP MAX-2, Intel MMX and SSE),
    2 cycles latency,
  • 1 SHIFT unit: ISHIFT,
  • 2 parallel shift units: PSMU0, PSMU1,
  • 1 parallel multiply unit: PMUL,
    executing 1 SIMD FP operation per cycle,
  • 1 population count unit (for popcnt instruction): POPCNT,
    only a single issue port for PMUL and POPCNT,
  • 2 Extended Precision Floating Point (FP) units: FMAC0, FMAC1,
    ANSI/IEEE-754,
    FMAC: Floating Point Multiply Add Calculation: multiply and add of 82 bit floating point values in one cycle (for matrix calculations),
    4 cycles latency,
  • 2 FPUs for other FP operations: FMISC0, FMISC1,
    4 cycles latency,
  • 4 memory ports in Data Cache Unit (DCU): 2 load units, 2 store units,
  • 3 branch units: B0, B1, B2.

11 issue ports:

  • 4 memory/ALU/multimedia: M0, M1, M2, M3
  • 2 integer/ALU/multi-media: I0, I1,
  • 2 FP: F0, F1,
  • 3 branch: B0, B1, B2,
serving the execution units above.

Dynamic prefetch, optimized branch prediction, speculative execution.

Branch prediction:
512 entry, two-level.
Branch Target Address Cache (BTAC): 64 entry.

Interval Time Counter (ITC): register for timing ticks.
In 32 bit compatibility mode: Time Stamp Counter (TSC).

Streamlined Advanced PIC (SAPIC): based on IA-32 APIC (Advanced Programmable Interrupt Controller),
for Aborts, Interrupts, Faults, and Traps:

  • handled by operating system: to Interrupt Vector Address (IVA) through Interrupt Vector Table (IVT),
  • handled by PAL firmware.
Interruption Status Register (ISR).
256 interrupt vectors:
  • 0 - 15: special, high priority,
  • 16 - 255: freely assignable.
Support for Intel 8259A interrupt controllers.

Virtual address space: 64 bit, no segmentation.
Multiple Address Space (MAS): each process has its own unique Virtual Region (flat linear address space).
8 61 bit Virtual Regions (Virtual Region Number, VRN; Region Identifier, RID), 224 Virtual Address Spaces of 261 bits.
4 kbyte - 4 Gbyte pages (Virtual Page Number, VPN).

Physical address space: 63 bit.
Up to 50 bits supported in page tables.

Write Coalescing (WC): streams of non-cachable writes can be combined into a single bus write transaction.
WC Buffer (WCB): two-entry, 128 byte.

Enhanced Machine Check Architecture (EMCA): parity and ECC (Error-Correcting Code) on all major address and data busses.

50 bit address bus.
Physical addressing:

  • 32 bit: 0-4 Gbyte,
  • 36 bit: 4-64 Gbyte,
  • 44 bit: 64 Gbyte - 16 Tbyte.
Virtual addressing: 54 bit.
Page sizes: 4 kbyte - 4 Gbyte.

200/266/333 MHz DDR bus (McKinley bus, Scalability Port): 128 bit data.
Source Synchronous Signaling (SSS).
6.4/8.5/10.6 Gbyte/s max. throughput.

Assisted Gunning Transceiver Logic signaling (AGTL+),
based on GTL+ bus of Intel Pentium III and Pentium III Xeon processors.
1.5 V ± 1.5 %.

Power pod connector.

Tests:

  • Build-In Self Test (BIST),
  • Test Access Port (TAP): IEEE 1149.1 (JTAG),
  • In-Target Probe (ITP): debugging interface for board integration,
    JTAG TAP, access to registers, memory, and I/O,
    ITP700 Debug Port (DB): command and control interface for ITP,
    max. 16 MHz,
  • Logic Analyzer Interface (LAI),
  • code debugging:
    Instruction and Data Breakpoint Registers (IBR, DBR),
    single stepping (through PSR.ss),
    breaks, taken branches (through PSR.tb), privileges,
    instruction and data debugging.

Processor performance monitoring and profiling:

  • Performance Monitor Configuration (PMC),
  • Performance Monitor Data Registers (PMD),
  • 4 48 bit performance counters,
    Montecito: 12 48 bit performance counters per thread.
Dynamic processor behaviour (instruction execution, caches, branch prediction, virtual memory translation) can be monitored with real-world operating systems, applications, and systems, and be fed back into the code generation process.

Multi-processing


Hyper-Threading Technology (HTT) (from Montecito),
Temporal Multi-Threading (TMT; Switch-on-Event Multi-Threading, SoEMT): threads not running simultaneously, core switches in case of high-latency event.

SMP (Symmetric Multi-Processing): glueless up to four processors (max. 16 in IA-32 compatibility mode).
Max. four processors at 200 MHz, max. two processors at 266 or 333 MHz.
Shared memory, cache coherency through MESI protocol.

Multiplier


Multiplier (Phase Lock Loop, PLL):
set through pins during reset:

multiplier\pin A21# - A17#
2/9 10110
2/10 10101
2/13 10010
2/14 10001
2/15 10000
2/16 01111

Power management


Power and performance management:
P-states:

  • P0: maximum performance, maximum power (highest utilization),
  • P15 (lowest utilization),
set for all logical processors (multi-threading, multiple cores), per dependency domain (depending on distribution network for clock and power),
managed by PAL.

Performance


Performance:

  • improved integer performance,
  • very fast FP units,
  • IA-32 compatibility by emulator: comparable to X86 processors.

Thermal management


Thermal management: via on-die thermal diode:

  • Thermal Alert: thresholds set through SMBus,
    THRMALERT# pin active when threshold crossed,
  • Enhanced Thermal Management (ETM): thresholds set through SMBus,
    when maximum exceeded, entering low power mode and Correctable Machine Check Interrupt (CMCI),
    when within normal range again, after one second back to normal mode and another CMCI,
  • Thermal Trip: processor shutdown when overheated,
    THRMTRIP# pin active, reset processor to resume.

System management


System management: System Management Bus (SMBus):

  • Processor Information EEPROM (PIROM): manufacturing and features information,
    permanently write-protected:
    • processor: s-spec / QDF number, sample/production,
    • core: architecture revision, family, model, stepping/revision,
      maximum core frequency, maximum bus frequency, voltage, voltage tolerance,
    • L3 cache: size, voltage, tolerance, stepping,
    • package: cartridge revision, substrate revision,
    • part numbers: processor part number (McKinley: 80542KC; Madison, Madison 9M, and Fanwood: 80543KC; Deerfield and Fanwood LV: 80544KC; Fanwood @ 266 MHz: 80533KE; 9000 series: 80549KC),
      processor electronic signature (64 bit serial number),
    • thermal reference (upper limits: Madison @ 1.3 GHz: 107 °C; Madison @ 1.4/1.5 GHz: 105 °C; Madison @ 1.6 GHz: 113 °C; Madison 9M: 113 °C; Fanwood and Fanwood LV: 105 °C; 9000 series: 92 °C),
    • features, IA-32 features, cartridge features,
  • scratch EEPROM: for OEM system designer information,
  • thermal sensing device (A/D converter), connected to on-die thermal diode.
3.3 V ± 5% (3.14-3.47 V).

Marking


Marking:

  • Intel brand,
  • legal mark,
  • product ID,
  • Finish Process Order (FPO),
  • serial number,
  • s-spec,
  • country of origin (not for 9000 series),
  • Assembly Process Order (APO),
  • 2D matrix mark (not for McKinley).

CPUID


CPUID: 8 byte registers:

  • registers 0-4: fixed region,
  • region 5 and further: variable region.

  • registers 0 and 1: vendor id information,
  • register 2: ignored,
  • register 3: processor implementation information:
    • bits 7:0: largest CPUID register number,
    • bits 15:8: processor revision number,
    • bits 23:16: processor model number,
    • bits 31:24: processor family number (McKinley, Madison, and Madison 9M: 0x1F; Montecito: 0x20),
    • bits 39:32: processor architecture revision number (McKinley: 0x00; Madison: 0x01; Madison 9M: 0x02; Montecito: 0x00),
    • bits 63:40: reserved,
  • register 4: processor features:
    • bit 0: long branch instruction (brl) implemented, no need to emulate by operating system (from McKinley),
    • bit 1: spontaneous deferral implemented,
    • bit 2: 16-byte atomic operations implemented,
    • bits 63:3: reserved.

Family number Model number Processor
0x07 0x00 Itanium Merced
0x1F 0x00 Itanium 2 McKinley
0x1F 0x01 Itanium 2 Madison, Deerfield, Hondo
0x1F 0x02 Itanium 2 Madison 9M, Fanwood
0x20 0x00 Itanium 2 9000 series Montecito, Millington

CPUID return values:

0x10 L1D: 16 kbyte, 4-way set-associative, 32 byte line size
0x15 L1I: 16 kbyte, 4-way set-associative, 32 byte line size
0x1A L2: 96 kbyte, 6-way set-associative, 64 byte line size
0x88 L3: 2 Mbyte, 4-way set-associative, 64 byte line size
0x89 L2: 4 Mbyte, 4-way set-associative, 64 byte line size
0x8A L2: 8 Mbyte, 4-way set-associative, 64 byte line size
0x90 ITLB: 64 entry, fully associative, 4 kbyte - 256 Mbyte pages
0x96 DTLB0: 32 entry, fully associative, 4 kbyte - 256 Mbyte pages
0x9B DTLB1: 96 entry, fully associative, 4 kbyte - 256 Mbyte pages

Set EAX register to 2, then returned in EAX, EBX, ECX, EDX registers (MSB - LSB):

EAX 0x00 0x15 0x10 0x00
EBX 0x00 0x00 0x88/0x89 0x00
ECX 0x00 0x9B 0x00 0x00
EDX 0x80 0x00 0x00 0x00

IA-32 CPUID cache returm values:

0x67 L1D: 64 kbyte, 4-way set-associative, 64 byte line size
0x77 L1I: 64 kbyte, 4-way set-associative, 64 byte line size
0x7E L2: 256 kbyte, 8-way set-associative, 128 byte line size
0x7E L3: 3Mbyte, 12-way set-associative, 128 byte line size

Market


Used by HP as PA-RISC replacement, and in High Performance Computing (HPC).


21.2.1  Intel Itanium 2 processor (McKinley)

21.2.2  Intel Itanium 2 processor (Madison)

21.2.3  Intel Itanium 2 DP LV processor (Deerfield)

21.2.4  HP Itanium 2 mx2 processor module (Hondo)

21.2.5  Intel Itanium 2 processor (Madison 9M)

21.2.6  Intel Itanium 2 DP processor (Fanwood)

21.2.7  Intel Itanium 2 DP LV processor (Fanwood LV)

21.2.8  Intel Itanium 2 9000 series Dual-Core processor (Montecito)

21.2.9  Intel Itanium 2 DP 9000 series Dual-Core processor (Millington)

21.2.10  Intel Itanium 2 DP LV 9000 series Dual-Core processor (Millington LV)

Page viewed 49591 times since Sun 1 Mar 2009, 0:00.