Q&A: How Important is Bandwidth to Pipeline Depth?

By Van Smith

Date: February 27, 2002

Hi Van,

I'm perhaps just up too late, but I had a question about CPU design and my EE friend wasn't sure of an answer. If you design a x86 CPU with a 10 stage pipeline, and then "hyperpipeline" it to 20 stages, would the need for memory bandwidth scale linearly or geometrically?

I was wondering given a few sites claiming that suddenly a 533Mhz FSB will substantially improve P4 performance. If it scaled linearly, I'd have to think that you'd need 4x more bandwidth for a 2.0Ghz P4 than a 1Ghz PIII to remain about as efficient.

But at the same time, unless it was because of its greater overall efficiency (and effectively the same bandwidth) than Rambus, PC2100 performs about the same. Any ideas?



Bandwidth should scale linearly with FSB/memory bus speeds with the same processor architecture.  The need for additional bandwidth only occurs if the processor is fast enough to handle the data stream.  The PIII is more than robust enough (i.e. its IPC and cache subsystems are fast enough) to absorb the data stream that the P4 enjoys, but Intel is deliberately inhibiting FSB speeds so that the PIII does not encroach any further over the P4 performance range (Intel was at one time considering a 200MHz FSB PIII).

One positive thing that the P4’s deep pipeline brings is that the ability to handle additional bandwidth (should the FSB be opened up) scales linearly with clockspeed and the 20-stage pipeline will enable higher clockspeed ramping of the core.

However, the 20-stage pipeline serves a severe penalty on branch mispredictions.  I think when you mention the “need for memory bandwidth” at higher clockspeeds with deeper pipelines, you are considering the tradeoff between greater bandwidth and the increasing penalties from pipeline stalls.  There is really not a hard and fast rule to state here since some applications demand high bandwidth and have few opportunities for branch mispredictions, while others, like most office applications, promote far more pipeline stalls.

In general, the P4 core is much less robust that the PIII’s and at any given clockspeed the PIII (especially the Tualatin) will outperform the P4 by large margins unless bandwidth or SSE2 are issues.  The P4’s design compromises were made in order to prevent the already mammoth P4 die size from getting much larger – implementing the 20-stage pipeline takes lots of real estate.

The P4 is a very bad design.  AMD could be killing the P4 right now in performance if they wanted to by enabling 400MHz FSB speeds for the Athlon (although the Athlon does very well at 266MHz FSB).  The Athlon’s bandwidth would be much higher than the P4’s, taking away one of the P4’s very few advantages.  I suspect that AMD’s decision to hold back the Athlon is being done in order to make the Hammer look even better at its introduction.




Pssst!  We've updated our Shopping Page.