Boosting the VIA C3's Floating Point Performance
By Van Smith
Date: March 31, 2004
VIA's C3 is an excellent CPU for low power x86 applications, but its integrated floating point unit is far from muscular. However, with a very minor software optimization trick, the C3's FPU's performance can be increased by up to 150% (2.5x) or more!
Under normal conditions, many compilers will program the CPU's floating point unit to manage error conditions by toggling corresponding bits in the floating point status word and otherwise processing the errors internally under narrowly defined rules. This mode of operation is enabled by hiding or "masking" from the outside world the handful of different error cases that might be encountered. Masking all FPU exception conditions is the current IEEE standard. Programmers sometimes choose to unmask floating point exceptions while debugging or for special error tracking features even when working with compilers that normally disable FPU exception trapping.
However, a number of compilers will always unmask FPU exceptions in order to enable the compiler's proprietary error handling mechanisms. When a floating point exception is unmasked, an error condition will then trigger an exception that the programmer must code for. If the programmer fails to take such precautions, the exception percolates to the operating system which will display an error message and then shut down the offending program.
All Borland compilers automatically create binaries that unmask several floating point exceptions upon program initialization. Delphi 7 unmasks FPU invalid operations, divide by zero, and overflow conditions. The specific Delphi 7 FPU initialization assembly routine is:
fldcw word ptr [Default8087CW]
The first command initializes the floating point unit to a known condition that masks all exceptions, but the third line loads a predefined 16-bit variable into the FPU's control register which then unmasks the exceptions listed above. The value of Default8087CW is set to $1372 in Delphi 7 whereas the default state of the floating point control register after initialization is $037F.
The VIA C3 line of CPUs all currently feature scalar pipelines that were designed for IEEE standards and suffer performance penalties when any FPU exceptions are unmasked. Fortunately, it is very easy and usually safe to mask all FPU exceptions. In fact, it is often more problematic to leave FPU exceptions unmasked especially when interfacing with DLLs built under different compilers. For instance, FPU exceptions must be masked when creating an OpenGL program or the program could crash when returning from an OpenGL routine.
All that is necessary to mask FPU exceptions is to initialize the floating point unit once prior to entering the code block that exercises the FPU. In Delphi, a simple routine to do this follows:
Calling this procedure upon object creation greatly enhances floating point performance for the VIA C3 and is either beneficial or benign for all other CPUs. If floating point errors are expected within your code and cannot be "debugged away," error conditions can be checked for by polling the floating point status word.
Calling the simple routine above has a profound impact on floating point throughput for VIA's C3 processors. In our soon-to-be released COSBI OSMark, decoding compressed photographs progresses much faster:
FINIT resulted in a 53% performance improvement on JPG picture decoding
The effect was even more pronounced on the FPU intensive "n-body" tests that simulates the motion of four bodies bound together by a gravity-like force:
The n-Body test enjoys a 152% boost when the FPU is properly initialized.
VIA's C3 processors have been successful in the ultra-small form factor space where the tiny processor has been embraced by hobbyists to robot designers to silent PC advocates to a wide variety of specialized x86 embedded product suppliers. However, the "cool running" C3's biggest weakness continues to be its floating point performance.
By properly initializing the C3's FPU by calling finit at form or object instantiation, floating point performance can increase under certain conditions by more than 150%. While still no threat to the Athlon 64 with its monster FPU, this added performance boost goes a long way towards making the little C3's floating point throughput more acceptable.
Copyright 2004, Van Smith