Part 1: AMD Athlon XP 2000+ Review
By Van Smith
Date: January 7, 2002
A savage war between Intel and Advanced Micro Devices has led to a windfall for consumers as prices have fallen dramatically, while processing power continues to increase. In this battle between the top two MPU manufacturers, AMD (nyse: AMD) has decidedly had the upper hand ever since the introduction of the Athlon XP in early October. The AMD Athlon XP line of microprocessors dominates the fastest Intel (nasdaq: INTC) Pentium 4s, yet cost much less money.
Today, both Intel and AMD launch yet more salvos at each other as the rival chipmakers announce new chips. Intel debuts a new Pentium 4 codenamed “Northwood.” Based on a young 0.13-micron process, Northwood features a 512kB level 2 cache, which doubles the older Willamette Pentium 4’s complement. Aside from this one enhancement, Northwood is otherwise functionally identical.
Rival AMD cranks up the clock speed with the release of its Athlon XP 2000+. Running at 1667MHz, the new AMD chip is a simple bump forwards for the successful XP core.
Quantispeed Versus Netburst
The contrast of Athlon XP against Pentium 4 is one of efficiency versus clock speed. Even the Pentium 4’s strongest proponents have to yield to claims that the Athlon XP will significantly outperform similarly clocked Pentium 4s. AMD pundits have to bow to the greater clockspeeds of Intel chips.
Of course, what really matters is real world performance. The Pentium 4 embodies the school of thought claiming that higher clock speed is a panacea for wringing more performance out of a given fabrication process technology. The key to reaching higher clock speeds is extending a processor’s “pipeline.”
Deeper Pipeline = Higher Clock Speeds * Lower Efficiencies
Imagine a “sandwich CPU” which has instructions to make ham sandwiches, BLTs, etc. To accomplish an instruction in one clock tick, a large number of steps would have to be hardwired so that the chip can start from nothing and end up with sandwich in a single clock cycle. Visualize a number of dedicated robotic arms, one handling bread, another delivering meat and so forth. These hands slam together a sandwich simultaneously in one Byzantine operation. Although this is complicated to implement, if you ask for a roast beef sandwich, you will get it one tick later.
Alternatively, the steps to making the sandwich could be broken out so that producing a sandwich takes, say, 20 ticks. At one step the bread is positioned, at the next, condiments are applied, then a slice of meat is added at the following station, etc. Now a sandwich won’t pop out immediately like in the first design, but implementing each step in assembly line fashion will be much easier to do. Because of this, the hypothetical sandwich CPU should be able to run faster because it doesn’t have to perform much work in each "tick" or step.
Furthermore, we can make each step in the assembly line flexible enough to perform analogous work on any sandwich requested. Set up this way, after the first sandwich pops out 20 cycles after it is requested, subsequent sandwiches will roll off the line every clock tick. So after the initial delay, or latency, this sandwich processor is as fast as the original processor.
Because the second processor cannot know what the orders will be 20 steps beforehand, it guesses them to keep this assembly line, or “pipeline,” fed. Modern processors are pretty good at making these kinds of guesses. However, a twenty cycle penalty will be exacted whenever our processor is wrong, or “mispredicts” a series of orders, and the pipeline needs to be restarted or “flushed.” As in real life, this happens often enough to worry about.
Pitted against our original processor, our pipelined processor will always be slower as long as it runs at the same speed, but hopefully the pipelining will allow our second chip to run faster.
Let’s now mercifully break from our sandwich processor analogy.
The Athlon XP has a ten stage pipeline while the Pentium 4 has twenty pipeline stages. Largely because of its deeper pipeline, the P4 can reach higher clock speeds. However, missed predictions that cause pipeline “stalls” also cause a very significant twenty cycle delay in the Pentium 4 versus only a ten cycle delay in the Athlon XP.
Deeper Pipeline = Greater Complexity and Design Compromises
What is worse for the Pentium 4 is that implementing all of its pipeline steps, or “stages,” requires a greater degree of design complexity making the chip big and power hungry. Both of these characteristics undermine attaining higher clock speeds, or “clock speed ramping.” Because the P4 grew so large, the number of functional units had to be reduced, which further handicaps it against the Athlon XP.
Higher Clock Speeds = More Thermal Issues
Another phenomena that becomes a growing concern in deeply pipelined designs like the Pentium 4 is localized overheating. Silicon is not a good conductor of heat and across a sprawling die one part of the CPU could be working furiously and getting hot while other regions of the chip might be idle and cool. Because the P4 has to run at high clock speeds to remain competitive with designs like the Athlon XP, an extreme temperature gradient might develop so that the part of chip could be destroyed.
Intel has attempted to combat this severe problem primarily through two means. The first is called “clock gating,” where idle portions of the P4 are shut down; therefore, in many instances, this reduces the total amount of heat being generated. Although this can make the chip cooler overall, temperature gradients across the die can become even more extreme.
Intel’s second measure was to implement the controversial Thermal Monitor. This mechanism utilizes a thermal sensor embedded in a region of the Pentium 4 die most susceptible to localized overheating. In other words, the probe is located in a “hot spot.” The problem is that there are several hot spots on the P4 die.
Thermal Issues Gate Performance
Unlike the Pentium III and mechanisms being rolled out with the Athlon XP, relying on a failsafe shutdown is dicey. By the time the Thermal Monitor registers a certain temperature, let's say 70 degrees Celsius, a distant hot spot could already be at dangerously high temperatures. On the other hand, perhaps merely the area immediately around the Thermal Monitor is 70, which is no reason for concern.
However, the Thermal Monitor cannot know which situation exists. What Intel has decided to do under this circumstance, is to immediately halt and restart the P4 in cycles of about 2 microseconds. This speed reduction is called “throttling.” A 50% duty cycle is imposed by default, but this duty cycle can be overridden through software or chipset control so that it is issued in 12.5% increments until the chip cools down. If the chip continues to heat up, at a predefined point a legacy PIII-type mechanism will force the chip to shut down completely to avoid CPU burn-up.
Consequently the Pentium 4 has the most elaborate thermal protection of any CPU, but there also exists the possibility that throttling will be encountered during normal processing. There have been several reports of such P4 behavior under heavy load. We documented our experience with one Pentium 4 that apparently throttled in Quake III after intensive testing. Because this phenomenon was isolated to a single chip, we concluded that the specific CPU was defective.
Nevertheless, all Pentium 4s have the capacity to throttle, and chips that throttle under load may serve to inhibit Intel’s yields at high clock speeds. The fact that Intel couples its Pentium 4s with truly radical cooling solutions whose retaining mechanisms apply so much force onto the chip that the most delicate part of the motherboard is visibly deformed, strongly suggests that the chipmaker is keenly concerned with P4 thermal issues.
Although moving to a smaller process demands lower voltages to be used, thereby allowing the chip to run cooler at any given speed, the heat that chip does produce becomes more difficult to dissipate. 100 Watts pumped through an electric blanket will yield a surface that is only slightly warmer than room temperature as the surrounding air is sufficient to carry away the heat, but forcing 100 Watts through a processor die will cause an immediate meltdown unless a good heatsink is applied. As the Northwood ramps to higher clock speeds, heat will almost certainly serve as a ceiling as it does with current Athlon XPs.
Voltage Transients Grow with Clock Speed
Finally, another problem for roadrunner, deeply pipelined chips is voltage transients. As clock speed increases, voltage transients become larger. When moving to the smaller geometries of Northwood, voltages are reduced, but relative voltage transients become even more severe. Intel has tried to reduce this issue by moving to Socket 478 which has a smaller package and more ground and power pins than Socket 423. However, a faster clocked processor is more more likely to be stymied by signal noise than a chip running at a lower clock speed.
Fewer Pitfalls to AMD’s Approach
With the Athlon XP, AMD has a core that has more execution units than the P4 does and a pipeline that is only half as deep as Intel’s chip. Consequently, the Athlon XP does not need to reach high clock rates to produce equivalent performance levels. With a smaller, simpler, more efficient design the Athlon XP is not as susceptible to regionalized overheating, and while the Pentium 4 has by far the more sophisticated thermal regulation circuitry, arguably it needs it.
However, adopting Intel’s level of clock gating would likely help AMD’s chips attain higher speeds. More aggressive clock gating is likely included AMD’s future design. Interestingly, the Athlon XP’s much shallower pipeline does not appear to be the limiting factor to the chip’s clock speed ramping. Rather, heat is. Likewise, thermal issues will likely prevent the Northwood from reaching the full potential of its deep pipeline – unless voltage transients stop it first.
Meanwhile, the Athlon XP, by virtue of running at lower clock speeds, will have an inherent immunity to voltage transients over the Pentium 4.
Interest Wanes in Hyper-Pipelines
After an initial frenzy of interest in Intel’s Pentium 4 design, it appears that some of those same CPU architects are beginning to recognize the shortcomings of so-called “hyper-pipelines.” Increased design complexity, greater die sizes, heightened power demands, greater susceptibility to thermal problems and the goblins of voltage transients all stand as significant barriers to wringing out enough clock cycles to make such dramatically deep pipelines worthwhile.
Thread Level Parallelism
Most modern microprocessors are able to exploit instruction level parallelism. What this means is that chips are able to execute non-dependent instructions at the same time to boost processing efficiency. Processing efficiency is commonly referred to as “instructions per clock cycle” or “IPC.”
The circuitry facilitating such scheduling trickery is complex. Proponents of VLIW (Very Long Instruction Word), like Transmeta, have argued for dumping these mechanisms and letting intelligent compilers organize code into batches that can be executed simultaneously.
Unfortunately for this group of people, it is very difficult if not impossible to offload this work to compilers and maintain performance levels equivalent to the best modern superscalar, out-of-order designs.
However, there is another side to this coin. Although greater parallelism will certainly lead to more performance, returns are beginning to diminish while complexity escalates.
In Hammer, the unspoken vision of AMD is to take its very good Athlon XP core, tweak it to run 64-bit code and improve processing efficiency, but to also keep it small and modular to enable eventual multiple-core-per-die processing.
Why do this? Because it allows the operating system to maximize processor usage by dispensing tasks at thread level to the different cores. In this way, it is like a compromise with the principles of VLIW in that software is effectively delegating parallel CPU resources.
Thoroughbred in Q1: AMD’s Die Size Advantage to Broaden
One of the most onerous aspects of the Willamette Pentium 4 is its huge die. At 217 mm2, the Willamette is over twice as large as its predecessor, the Pentium III. In fact, Intel needs roughly three times as many wafers to produce equivalent numbers of Pentium 4s as it did to produce the prior generation of chips.
Clearly this poses massive logistical problems as the chipmaker transitions production to this newer product line since moving to the Willamette Pentium 4 means a sudden tripling demand for fabrication resources.
The Northwood, developed on a finer 0.13-micron copper process, relieves some of this demand by “shrinking” the Pentium 4. Even with its larger 512kB level 2 cache and 13-million more transistors, the Northwood is a comparatively svelte 146mm2.
However, the Northwood is still over 13% larger than the current Athlon XP. Worse still, AMD plans to begin shipping its own 0.13-micron chip, dubbed “Thoroughbred,” later this quarter. At only 80 mm2, the Thoroughbred actually increases AMD’s existing die size advantages over the Willamette.
The current Athlon XP, also called “Palomino,” is 59% as large as Willamette, while Thoroughbred will only be about 55% as large as Northwood.
In fact, the Thoroughbred’s die size is so small that it is beginning to approach a condition known as “pad limited.” When a chip is pad limited, its die size is determined by the number of pads needed to connect the chip to its pins. Therefore, a pad limited chip is wasteful because some of its die will be empty.
Under such conditions, it is very compelling to throw more cache onto the die, or, better yet, install another core for on chip SMP (Symmetric Multi-Processing), as discussed above. AMD’s chips are evolving brilliantly towards this goal.
Intel, too, recognizes the benefits of exploiting thread level parallelism. However, at least in its initial steps towards achieving this in upcoming P4 Xeon chips, Intel is taking a much different approach. The Santa Clara chipmaker has developed an interface to its P4 that makes it appear to be two processors to the OS, but it is not. Instead, it distributes the workload from multiple threads across existing Pentium 4 execution units ensuring greater processing efficiency.
Intel claims up to a 40% gain in performance from Hyperthreading. Although 40% will be the exception rather than the rule, Hyper-Threading is a laudable accomplishment, but it is not without its shortcomings. As we have already mentioned, thermal constraints will likely prevent the Pentium 4 from reaching its deep pipeline’s potential.
Hyperthreading might serve to largely negate Intel’s clock gating measures and significantly lower the chip’s thermally induced clock speed ceiling.
Needless to say, the Pentium 4’s large die size is less friendly to multi-core designs than the much smaller AMD designs.
A Painful Northwood Transition
And although the Santa Clara chipmaker would like investors to believe otherwise right now, the Northwood shrink does not immediately translate into lower costs and higher profits for Intel.
Typically, moving to a new process is difficult and initial yields are low. Not only must the processor’s architecture be tweaked, but the fab’s “recipe” for producing the chip will be completely different from its predecessor’s. Many process variables need to be refined through an iterative process to maximize yields and this can be a lengthy and frustrating affair.
This very issue has been demonstrated publicly in the last few months as both MPU-maker Transmeta and graphics controller giant NVIDIA have fought capricious leprechauns in migrating to TSMC’s 0.13-micron copper process. Both companies have seen repeated delays in their next generation products with fingers being pointed in all directions by every involved party.
Intel is not just moving to 0.13-microns, but is now adopting copper, a technology that AMD mastered years ago. Although the chip goliath has had its toes in the new process since the introduction of the PIII Tualatin, a major wrench was thrown in the works when crucial SVG 193-nanometer lithography tools, needed to retrofit existing Intel equipment for 0.13-microns, was a complete no show.
Without the Silicon Valley Group’s tools, Intel has been forced to scramble for an alternative approach, namely hard phase shift masks. This technology is reportedly more expensive, and the unexpected direction change will likely extend process qualification times.
Bert McComas, Principal Analyst of Inquest Market Research, codified Intel’s problems when we spoke with him recently:
Has Intel put enough capacity in place using the old tools, or did they hold back waiting for the real thing? They may have only planned enough capacity in dual phase to satisfy Tualatin demand for servers and notebooks. If capacity is a problem, can they ramp up quickly with more dual phase capacity... Also, could we expect a shortage of Tualatins as Northwood ramps? This would translate into a boost for AMD and VIA if the market can anticipate and prepare to react by putting the right platforms in place for notebooks and servers.
Northwood was originally expected to launch last October, but Intel had to postpone the chip until today. Reported Northwood shortages appear to confirm suspicions that this delay might have fallen short of the chipmaker’s aim to provide enough time to sufficiently ramp up production. A quick survey of Price Watch, a shopping engine that specializes in computer parts, comes up dry for Northwood P4s, while the Athlon XP 2000+ is available from about a dozen vendors. [ed: five vendors are now listed on Price Watch carrying the 2.2GHz part.]
2.2GHz Pentium 4 parts have also not yet made their way to Intel favorite son OEM vendor, Dell computers. The Round Rock, Texas-based computer maker usually has first dibs on new Intel chips. [ed: A system has since turned up on Dell's site.]
Circumstantial evidence strongly suggests that Intel has not yet solved its production problems involving high-clocked P4s. This makes today’s debut of the 2.2GHz P4 a somewhat hollow affair.
While AMD has fired a bona fide cannonball at Intel’s noggin in the Athlon XP 2000+, it appears that Intel might be shooting back blanks with the 2.2GHz Pentium 4.
Please view Part 2 of this review for numerous benchmark results.
Pssst! We've updated our Shopping Page.