Bits & Bytes: The P4 is Dead

By Van Smith

Date: May 13, 2004

Let's hope that those Intel engineers who gave the thumbs up to Prescott
don't decide to take up bridge building, aircraft construction, and nuclear power plant design.  If they do, then we'll all need to walk around in radiation-proof-scuba-equipped-skydiving outfits. 

As it stands now, these silicon wielding simulacra have managed to craft the biggest catastrophe in the history of computing.  They've completely wrecked Intel's bread-and-butter desktop business.  They've killed the P4.


The Pentium 4 is Dead

Late last week Intel announced that Tejas, the next Pentium 4 in line after Prescott, has been canceled because of thermal issues.  We exclusively disclosed the existence of Tejas two years ago.  Instead of Tejas, the world's largest chipmaker will concentrate its efforts on bringing dual-cored Pentium-M technology to the desktop sometime in 2005.

Faithful readers of these pages will find these developments awfully familiar.  We've been agonizingly enunciating the flaws with Intel's Pentium 4 architecture ever since Willamette's introduction in 2000, and the very issues we raised were those that led to the high-profile processor's ultimate demise.  Moreover, the course of action that we have been recommending for Intel is the direction the chipmaker is finally following.  (For background information, good places to start are here, here and here.)

In a nutshell, HyperPipelining is crazy.  While some in Intel have sung panegyric love songs about sixty stage pipelines sweeping us tenderly to 10GHz in a few short years, these pundits must have plugged their ears to the head-splitting racket that thermal density and signal integrity were raising.

Deep pipelines require high clock speeds to remain performance competitive with more processing efficient, shallower pipeline designs.  But higher frequencies demand more power and faster transistors.  Yet fast transistors are leaky transistors and they get leakier with newer, smaller process technologies.  So deep pipelines require relatively higher and higher power as CPUs migrate to new technology.

And higher frequencies magnify voltage transients, so Vcc has to be bumped up to ensure signal integrity.  But, worst yet, power requirements rise with the square of voltage, so…

…you end up with Prescott, a new CPU on a smaller process with a ridiculously deep pipeline that makes Prescott slower than its predecessor, Northwood, at the same clock speed (despite having twice the L2 cache - quite a feat in itself).  Yet Prescott demands more power than Northwood and has to dissipate its greater heat output through a smaller die thus making thermal density rise egregiously.

Consequently, Prescott's >50% deeper pipeline is all for naught because thermal (and probably signal quality) issues gate gigahertz.  The very sad thing for Intel is that, barring process/circuit design miracles, Prescott won't be able to reach much higher frequencies than Northwood.

Speaking of the process and circuit design folks, it's no secret that Intel has some of the very best in the world.  Ever since the Willamette, the heat has been on them to save Intel's bacon.  You can bet that few tears will be shed among them over the demise of the P4.  With each process shrink, their jobs became more and more impossibly difficult.

Although Prescott was no doubt a marketing driven monstrosity, at least one engineer in a place of authority who should have known better made the boneheaded decision to give it a green light.  Crazy, just, downright crazy.


Moving Out to Move Up

The future of desktop computing is the production and refinement of small, out-of-order, superscalar processing cores and then scaling performance upwards by placing several of these cores on a die.  Multiple cores per die is easy to implement, guarantees well understood performance gains, and does not pose surprises in the thermal density or signal integrity arenas.

Processors today are rapidly approaching a limit to practical computational efficiency.  In fact, the AMD Athlon 64 is very near that asymptote where increasing design complexity is not worth the meager performance gains.

It is worth noting that, while Intel's Pentium-M and VIA's processors are following AMD to this asymptote, the Pentium 4 was headed in the opposite direction; while the thermal ceiling slips downwards with each process shrink, therefore normalizing clock speeds among CPU design families, the P4's overall performance was getting shoved into the dirt.

Besides the addition of on-die memory controllers, in the near future, the only practical areas still open for improving processing efficiency is through the addition of more/wider SIMD/vector units and hardware support for specific functionality like VIA's AES implementation in it's C5P line of chips.  Eventually graphics core functionality will also be moved on-die where arrays of vector units can be dynamically allocated for either general purpose processing or graphics rendering.

Small, efficient, modular cores have the added benefit of allowing a processor vendor to produce a highly scalable product line targeting everything from handheld gaming devices to behemoth NSA supercomputers.

While it is true that few of today's major software applications will see any benefit from multi-cored CPUs, you can bet your bottom dollar that this is going to change rapidly over the next few years for the simple reason that adding multiple threads to applications is the only direction to go to get significantly greater performance from emerging processor designs.  And besides, multithreaded applications are relatively easy to write with modern programming tools.


In Through the Out Door

We've always referred to Intel's Pentium-M, the chip maker's current line of mobile processors derived from the Pentium III, as the design that the Pentium 4 should have been.  Perhaps Intel was listening to us (well we know that they were reading us) because now PM is filling the void from the P4 collapse.  Intel intends to drive multi-core Pentium-M designs to the desktop next year.  Bully for them because this is the spot-on right move.

Up until now, Intel has focused its Banias/Dothan (130nm and 90nm Pentium-M, respectively) design efforts on optimizing for power efficiency in order to maximize battery life in mobile applications.  Although Dothan's leakage power has gone up a lot from Banias's, for the most part Pentium-M is a very strong mobile line - and certainly much, much better for mobile products than Pentium 4 products ever were.

Because of this focus on low power consumption, Pentium-M's floating point unit, one of the most power hungry parts of the CPU, has been neglected.  It's still a good FPU, but compared with the unit outfitted to the Athlon 64 (or even the Athlon XP - its FPU is almost identical to its 64-bit son's), it comes up wanting.

With little additional design effort Intel should be able to produce desktop Pentium-M derivatives that match the clock speeds of AMD's parts.  Furthermore, bumping the front side bus up to 1GHz should be no sweat.  While this is not as optimal as integrating the DRAM controller on-board a la the Athlon 64, these two measures alone will go a long way to providing performance competitive parts.  On top of this, Intel will likely enjoy a significant die size advantage.

However, as long as AMD's 90nm shrink of the Athlon 64 goes as swimmingly as the company publicly maintains, they should enjoy a growing performance lead over Intel for about two years.  After that, Intel will likely have refined the old PM-PIII core enough so that it will be able to stand toe-to-toe with AMD's cores at any given clock speed.

Of course, Intel would have already been there by now if it hadn't wasted so much time, money and effort on the P4.

Now is AMD's time to execute on the production side and act swiftly and decisively to grab market share that is ripe for the plucking.  However, if AMD squanders its lead away, Intel could come back and squash them like a bug.

Make no mistake, Intel has the resources to whip the Pentium-M into the same performance league as the Athlon 64, but it will take about two years to do so.  After that, AMD, Intel and even VIA will have small, modular, out-of-order, superscalar x86 cores that all perform roughly in the same ballpark. 

Beyond this convergence point awaits new territory and new business challenges as x86 cores will swiftly become commodity items.  At this time Intel's fabrication strengths can play to its advantage, but only if the chip giant is willing to suffer much lower margins than it has historically enjoyed.

Only when competitive 3d graphics processing becomes integrated on die will sufficient product differentiation exist to, at least temporarily, drive back up profit margins.

VIA, too, can capitalize on Intel's recent missteps.  Dothan's leakage power is quite high compared with Banias' or VIA Antaur's.  Moreover, VIA plans to have a 2GHz part of its own late this year built with IBM's best 90nm SOI technology.  While Antaur will not come very close to the Dothan's overall performance, VIA's SSE2 enhanced, security laden chips will offer very, very good performance per Watt.  The 90nm Antaur will also be an inexpensive drop-in substitute for economical thin and light notebooks leveraging the Pentium-M chipset infrastructure.

But to get back to Intel, as drastic as it was for the Santa Clara, Kalifornia-based company to kill off its bread-and-butter line of desktop processors, it was clearly the right move.  Although the chip peddler will suffer in the performance stakes for the next couple of years, finally promoting its Pentium-M line over its swiftly crashing NetBurst architecture is the best course of action for Intel to make.  The chip maker is unquestionably in a much stronger position now, though near term market share might not always reflect this.

Although the Pentium 4 (especially Prescott) was one of the biggest missteps in computing history, we might all need to be grateful for it.  If Intel had followed AMD's lead instead of adopting the marketing driven P4, Intel's process technology advantages might have allowed it to create a chip that out-Athloned the Athlon, tipping the talented but fragile rival into insolvency.


The Security Wildcard

Intel and AMD appear to be committed to the so called "Trusted Computing" road.  If these measures are as egregiously invasive as many fear them to be (try here, here, here, and here for starters), consumer backlash could drive people stampeding to VIA's chips which take a completely antithetical approach to security.

VIA's CPUs provide extremely powerful security functions that allow the end user to quickly, safely and easily encrypt/decrypt anything: wireless data transmissions, email, voice communication, individual files or even complete file systems.  These security functions, which will be expanded significantly in the 90nm Antaur, are exposed as simple x86 instructions that often provide little, if any, hit to concurrent processing.

Recently, VIA audaciously released a hardware accelerated WASTE-based peer-to-peer encrypted file sharing client that also provides encrypted chat features.  What makes this move so stunning is that the company released this Open Source program at a time when the FBI is trying to quietly push through laws requiring snooping backdoors on all chat and voice-over-IP clients!  Not only does this demonstrate the company's commitment to personal liberty and privacy, but it shows that VIA has a lot of guts as well.

VIA's security strategy is not without risks, however.  By taking a direction 180 degrees opposite of Trusted Computing, the company could be left out in the cold if laws are passed mandating the use of Trusted Computing compliant computers to access the Internet.  As insane as such laws may sound, there has already been a lot of work done in this direction and, though Sen. Fritz Hollings' Bill failed to pass, more effort is slowly forming a head.

The best outcome for personal privacy and liberty would be for all parties to follow VIA's lead.


The Creature Maker Visits Austin

Speaking of liberty, an author popular with freedom loving Americans visited Austin to give a lecture on taking back our country.  Speaking to a standing room only crowd, G. Edward Griffin's lecture captured the rapt attention of a group of Americans concerned with the Patriot Act, erosion of the Bill of Rights, biometrics, chip implants, corporate media propaganda, computerized voting fraud, increasing acceptance of torture, declining diversity between major party political candidates, orchestrated wars, imposition of a world government, gun control, our country's migration towards becoming a cashless society and other trends in our nation away from the concepts of personal liberty and independence that made our country unique and strong.
G. Edward Griffin
You can hear a portion of Mr. Griffin's lecture here.

Mr. Griffin's most famous work is the riveting treatise on the Federal Reserve entitled The Creature from Jekyll Island.


Didn't We Kill this Guy Already?

The grisly execution of American Nick Berg has been pinned to Jordanian "al Qaeda leader" Abu Musab al-Zarqawi, yet, according to reports released back in March, the U.S. killed the wooden-legged al-Zarqawi in bombing attacks.  It's certainly true that the reports of
al-Zarqawi's death were unverified, but how did the unfortunate Mr. Berg end up in his killer's hands so quickly after he was released from apparently unwarranted coalition custody?


Transmeta's Achilles' Heel
If you've ever used a Transmeta notebook, you have no doubt noticed how sluggish Transmeta systems can be in everyday operation.  The reason for this is simple.  There are severe translation latency penalties paid under many, many conditions.

Transmeta processors are not native x86 CPUs, but use an emulator to translate x86 instruction streams into native Transmeta machine instructions.

First we'll look at the number of clock cycles necessary to multiply a 2,000 element, double precision (floating point) array by an arbitrary value.  GetCycleCount is an assembly language wrapper for RDTSC.  The overhead for making calling two calls to GetCycleCount is included in all data presented here (direct inlining yielded the same amount of overhead).  The code for the simple test follows.

function TfrmMathTest.FPUMulArray: int64;
  lStart, lStop : TRDTSCTimeStamp;
  i : integer;
  if cboxRandomize.Checked then RandomizeValues;
  randomizearray( FPArray );
  lStart.int64 := GetCycleCount;
  for i := low( FPArray ) to high( FPArray ) do begin
    FPArray[ i ] := FPArray[ i ] * gamma;
  end; // for
  lStop.int64 := GetCycleCount;
  result := lStop.int64 - lStart.int64;

Looking at the initial run, you can see that the efficeon is staggeringly poor on this test.  The translation penalty that has to be paid each time a new routine is run can be tremendous.  In the chart below, we graph the number of CPU clock cycles required to run the test discussed immediately above.  Fewer clock cycles (shorter bars) is better.

However, if you execute the same routine many times, Transmeta processors sometimes cache translated code so that subsequent runs do not suffer translation penalties.  Under certain circumstances the efficeon is able to cache the simple test above and if it is run again, performance becomes much better.

The efficeon does not cache all translations, though.  For instance, it doesn't matter how many times we reran short, nonlooping tests, we always experienced translation penalties.  The chart below illustrates an example.

Translation penalties explain why Transmeta-based computers are so very sluggish when working through everyday applications.  Because most benchmarks loop over the same routines many times, translation penalties are diluted.  This is why most benchmarks do not reflect real world usage experience on Transmeta systems.


New Notebook

I am a compulsive gadget geek.  Fortunately for me, Kathy has a soft heart and tolerates my binges.  Last week I bought a new notebook and it is a brawny beast, especially for the $1,350 with 2 years no interest that we paid.  It is an eMachines M6809.  Below are the specs copied directly from eMachines web site.

Display: 15.4" Widescreen TFT LCD WXGA (1280 x 800 max. resolution)
Operating System: Microsoft® Windows® XP Home Edition
CPU: Mobile AMD Athlon™ 64 3200+ Processor
64-bit Architecture operates at 2.00 GHz
  System Bus uses HyperTransport™
  Technology operating at 1600 MHz
  1 MB L2 Cache
Memory: 512 MB DDR SODIMM (PC 2700)
Hard Drive: 80 GB HDD
Optical Drives: DVD +/- RW Drive (Write Max: 2.4x DVD+R/RW, 2x DVD-R/RW, 6x CD-R, 10x CD-RW; Reads 24x CD, 8x DVD)
Media Reader: 6-in-1 Digital Media Manger (Compact Flash, Micro Drive, MultiMedia Card, Secure Digital (SD), Memory Stick, Memory Stick Pro)
Video: ATI® Mobility RADEON™ 9600 with 64 MB Video RAM
Sound: PC2001 Compliant AC '97 Audio
Built-in Stereo Speakers
Modem: 56K* ITU V.92 Fax/Modem
Network: 802.11g Built-in Wireless (up to 54Mbps), 10/100Mbps built-in Ethernet
Pointing Device: Touchpad with Vertical Scroll Zone
Battery: 8-cell Lithium-ion (Li-ion)
Dimensions: 1.6"h x 14.0"w x 10.4"d Weight: 7.5 lbs. (8.65 total travel weight)
Internet: AOL 3 month membership included, click here for details
Ports/Other: 4 USB 2.0 ports, 1 IEEE 1394, 1 VGA External Connector, 1 S-Video Out, Microphone In, Headphone/Audio Out, 1 PCMCIA Slot (Card Bus type I or type II)
Pre-Installed Software: Microsoft Works 7.0, Microsoft Money 2004, Encarta Online, Adobe® Acrobat® Reader™, Microsoft Media Player, Real Player, PowerDVD, Internet Explorer, Roxio Easy CD & DVD Creator (DVD Edition), BigFix®, MSN®, CompuServe®, AOL (with 3 months membership included**), Norton AntiVirus 2004 (90 day complimentary subscription)

Yeah, it's real fast.  If I get some free time, I might write up a review.  For the time being, I've been very pleased.


Copyright 2004, Van Smith