Posted By Van Smith
Date: July 31, 2001
A VHJ recent analysis utilizing a popular benchmark has led the developer to seek Intel's further assistance and revisit the bandwidth test code in an attempt to provide additional Pentium 4 optimizations.
SiSoftware's CTO, Adrian Silasi, told VHJ that his company plans to attempt to further enhance the Pentium 4's bandwidth performance in Sandra after VHJ tests showed DDR SDRAM delivering much higher bandwidth efficiency than RDRAM.
Although the Pentium 4's code is nearly a year more mature than the analogous Athlon routines and even though Intel "optimized the initial SSE2 STREAM benchmarks," Mr. Silasi maintains that he has now "asked Intel to optimize the P4 benchmarks properly" because "the SSE tests take a big hit on P4 due to the weak FPU (e.g. even "exchange" instructions don't pair up) thus unless you use SSE2 you have a problem."
Additionally, the Sandra beta 8.15 used in our tests contained altered AMD code that undermined Athlon-DDR SDRAM performance. Of these routines, Mr. Silasi said he "re-optimized them for the PIII." This "killed Athlon performance in the FPU test," Mr. Silasi revealed.
However, in the latest Sandra beta (v8.21) Mr. Silasi states SiSoftware "reverted back to their [AMD's] code," but "I have also written SSE STREAM benchmarks for the P3 based on SGI's article. These are not used yet, but may be (for P3) if AMD and I don't agree on their benchmark optimizations. I have sent these to Intel for further optimizations."
In a promising turn, Mr. Silasi concluded "I have asked both [Intel and AMD] to allow the release of the source code. This way you shall be able to see/show it to "experts" to confirm that there is no 'monkey business' going on."
FPU <> SSE2
After the initial release of Pentium 4 Sandra optimizations immediately prior to the chip's launch, I was involved in intense debates with SiSoftware regarding their decision to "SSE2 optimize" P4 FPU results and report the numbers as if derived via the P4's true FPU. "You've touched an issue which we discussed quite a bit here and with Intel," Mr. Silasi stated. "The [Pentium 4's] FPU seems indeed 'weak'; not surprisingly SSE2 is really needed to bring performance to usual levels," he continued.
Here is a note I sent in response:
Thanks for the quick reply.
Going into options and disabling SSE2 optimizations does not impact the comparison results for the P4, and this is, I believe, is specious given that this performance will not be realized unless an application is specifically optimized for SSE2. I am sure that Intel would argue strongly against this [to report true FPU performance instead of SSE2 derived numbers], but my point is nonetheless valid. If you show the current P4 SSE2 Whetstone results in your FPU comparisons, then you are essentially serving as an advertising tool for Intel because your results do not reflect why a user's current FPU intensive applications run slowly on the P4. Upgrading software to take advantage of SSE2 can be an expensive hidden cost for people looking at Sandra's results and deciding to take the leap to the P4.
I think the question "what would be the gain in using SSE2" should be kept separate from default FPU performance, since this gain will not be realized for any existing applications. If the SSE2 result is not broken out into a different bar or benchmark, then I suggest turning it off by default, that way the comparison results would show actual FPU performance. Again, you have a different section for SSE, SSE2, 3dNow! and I think it is good to keep it that way.
I know the SSE streaming commands would essentially bring bandwidth results in line with theoretical hardware limits if coded properly. 3dNow! for the Athlon also introduced streaming commands, but you do not seem to be using them. Why not?
Thanks. Determining the trustworthiness of our benchmarking tools is a number one priority.
In a later note I wrote:
I am not discounting the importance of the SSE2 obtained Whetstone results; I simply believe that at this point in time the true FPU results are more important, especially under the guise of "FPU" as the results are labeled in your tests. It would be nice to see both sets of (reference) results...
After a few more sometimes heated exchanges, Mr. Silasi agreed to the compromise contained in current versions of his successful program. Sandra's Pentium 4 FPU results are shown as a bi-colored bar with the first part of the bar displaying true FPU performance with the remainder of the bar reporting SSE2 results. This is illustrated in the picture below.
New Athlon-DDR SDRAM Bandwidth Results
We tested a 1.4GHz Athlon with an Epox 8KHA DDR SDRAM motherboard using 256MB of CL2, 4-way interleaved memory running the version of Sandra that Mr. Silasi reports as having unadulterated AMD streaming code. The integer bandwidth results are shown below:
The bandwidth for DDR SDRAM has improved significantly. We are using a different chipset, so this may have some impact (the difference in CPU speeds should only account for little if any improvement). We will provide more comprehensive results soon in a related article.
Using AMD's code the recalculated bandwidth efficiency graph, below, shows DDR SDRAM extending its lead over RDRAM.
For the Pentium 4-RDRAM platform to catch up, the Intel/SiSoftware new code will have to deliver about 2.7GB/s. This is about 700MB/s faster than the current iteration of the Sandra P4 optimized bandwidth tests.