COSBI OpenSourceMark Preview

Written By Van Smith

Date: April 22, 2004

We have been promoting the notion of Open Source benchmarks for several years, all the while halfway hoping that someone else would take the initiative on this obviously worthwhile task.  Well, we're tired of waiting.  You can download a preview of our efforts here.

We've documented and have been proven correct about our allegations against commercial benchmarks.  But even discounting the scandals that continue behind the scenes in the closed source development of commercial benchmarks, there are obvious shortcomings that closed source benchmarks will always have.  The biggest flaw can be summarized in one word: "verifiability."

Without knowing what is specifically going on inside a benchmark, the results it produces are useless, yet so-called analysts will quote numbers from commercial benchmarks without having an inkling of insight into what the numbers actually mean.  Worse yet, many commercial benchmarks churn through hours of tests and spit out one or two "dumb" numbers that are supposed to represent vague "business application" or "content creation" performance.

Open Source benchmarks are innately verifiable.  Anyone will be able to freely dig into the code to find out what's really going on.  Users can even tweak Open Source benchmarks to their specific needs.  Open Source benchmarks empower the end user while closed source commercial benchmarks are often little more than marketing tools.



COSBI OpenSourceMark, or "OSMark" for short, is an object oriented testing harness written in Borland Delphi 7.  The first versions of OSMark contain a fixed set of tests that are meant to demonstrate to programmers, analysts and end users how to produce their own benchmarks.  We designed the class framework to facilitate the quick and easy development of new benchmarks and this is at least as important as the binaries that we will distribute for benchmarking. 

Of course, the tests are also intended to provide valuable performance insight in a benchmarking environment.

COSBI tests are written in several languages, but we selected Delphi for our main development environment because it is based upon Objected Oriented Pascal which is much easier to read and understand than C or C++ code.  Also, Delphi's Rapid Application Development (RAD) tools are superlative, greatly simplifying the production of complex GUI-based applications and tests. 

Delphi also has a Linux clone called "Kylix" that will help us port COSBI to Linux.  Furthermore, Borland offers Kylix in a free version for Open Source development which makes COSBI development accessible to everyone.  Although Borland has regrettably stopped giving away Delphi (but the free Japanese port is still on the web), a time limited trial edition can be downloaded for experimentation.  There are also very substantial educational discounts available for Delphi.  Although the bulk of COSBI is written in Delphi, some of the benchmarks are written in C and C++, so Delphi/Kylix is not necessary for producing COSBI OSMark tests.

Note: The fate of future Kylix versions appears to be grim.  If the situation does not improve, we will likely look into moving COSBI to another programming environment like Qt.

We will discuss COSBI source code in detail when we complete and release the documentation.  All source code will also be made available at this time.  If you are a webmaster, analyst or are simply interested in contributing code to OSMark, let us know.  We will give access to the code to a limited number of people prior to the official release in order to reduce support headaches.

The tests in the OSMark binary that we distribute today are meant to target specific and orthogonal performance characteristics.  The tests range from purely synthetic -- meaning that the tests are not "real world" but are attempts to determine performance under certain tasks -- to application level benchmarks.  We will briefly discuss each test below.

In our recent Transmeta efficeon review, we exclusively used COSBI OSMark to evaluate performance.  We also used COSBI tools to discover that the efficeon will reduce its clock speed under very light load to avoid overheating.  We are confident that the results that OSMark provided in our review will reflect the performance of efficeon better than any other single benchmark.  In fact, we strongly believe that OSMark is superior to FutureMark's PCMark, with the most significant advantage being the availability of OSMark's source code.

COSBI, which stands for the Comprehensive Open Source Benchmarking Initiative, does not begin and end with OSMark.  We have also developed a business application level benchmark that is in alpha stage.  We have also produced a number of targeted benchmarking tools for examining performance areas like memory bandwidth, hard drive performance, etc. in greater depth.  And we have also developed a number of utilities that help users log CPU usage, track CPU clock speed, and test for CPU throttling.


Installing OSMark

One of our design objectives with OSMark is to completely avoid installation packages.  This allows OSMark to be used to test computers in situ at computer stores before purchase without polluting the target system's registry or system files.

To install OSMark, download the zipped file and extract it to a location of your choice.  Be sure to preserve the directory structure (this is usually the default).  If you need a zip program, an excellent option is the Italian freeware program ZipGenius.


Running OSMark

To run OSMark, double-click the executable "CosbiOpenSourceMark.exe" inside the directory "OSMark."  The program takes a few seconds to launch because it probes your computer for system information.  The interface is described in the illustration below.

The most commonly used option will be the "Official Run" which selects all tests, runs three iterations, picks the best of three runs for each test, generates an OSMark score and opens the powerful result viewer shown below.



Result Viewer

The OSMark Result Viewer is extremely powerful.  Not only does it graphically illustrate the performance of your system relative to other systems, but comparison systems can be filtered, the graphs support zooming, individual tests can be examined and charted, the results can be sorted by many different criteria, data can be exported into different formats, the graph can be saved to a file or copied to the clipboard, the graph background can be changed to make it more suitable for publication, the entire database of results can be saved and traded and more.

Here is a screenshot of the OSMark Result Viewer.

As you can see, results are presented in the form of stacked bars.  Clicking on any bar segment will result in the graph exploding into a view of that specific test.

Clicking on any bar will then restore the graph to the original view.

The table at the bottom of the results window contains system information and score breakdown.  Clicking on a column header will sort both the results and graph by that column.  If the column contains test results, the results will be sorted and the graph will toggle between exploded view and composite view.

It is very easy to add a background image to a graph to make it suitable for publication.  The chart below, with the VHJ watermark, was produced entirely within OSMark and was copied to the clipboard and pasted into this article.  This process only took a few seconds.

Results can also be easily exported to a variety of formats.  The export screen is shown below.

Results can also be filtered to provide comparisons with systems of your choice.  Here is the filter dialog.

The filter action above added the 1GHz Transmeta efficeon to our result view.

Note: OSMark currently defaults to showing all of the results from the result database.  We will implement the Default Filter prior to the official release.  In the meantime, showing all of the data can make for a busy chart.  You will be able to quickly locate your results by search for a platform name that begins with "My."  Alternatively, you can sort the results by date.

Zooming is accomplished by clicking and dragging from the upper-left corner of the region you are interested in viewing to the lower-right corner.  The graph below shows the result of a zoom. 

To un-zoom, you can either click and drag in the opposite direction as stated above, or you can click the "UnZoom" button.

The data in the grid can be edited and rows can be added or deleted.  Changes will not be permanent until the "Save" button is pressed.


Test Scores

All test scores are normalized against a Dell OptiPlex SX270 with a 3GHz, 800MHz FSB Pentium 4.  HyperThreading is disabled.  The system uses two channels of DDR400 and the graphics adapter is the integrated Intel 82865G.  The hard drive is a 160GB, 7200rpm, ATA133 Maxtor DiamondMax Plus 9.  The reference system scores 1,000 on all tests.

If a system is twice as fast as the reference system on a specific test, it will score a 2,000.  If it is half as fast, it will score a 500.

The OSMark overall score is the average of all of the normalized individual test scores.  We chose arithmetic mean over geometric mean to generate the OSMark score because we believe that transparency is one of the most important goals to pursue in benchmark development.  Most people have an intuitive understanding of the arithmetic mean, which is just the simple average, while very few people even know how to calculate the geometric mean, much less understand it.

Furthermore, the arithmetic mean is naturally reflected in stacked bar charts which we feel are ideal for conveying the overall score.

Although the decision to go with the arithmetic mean would seem to be obvious, the VIA C3 complicated the issue.  VIA's tiny chip so completely dominates all other CPUs on a single, but very important test that its OSMark score is one of the highest of all CPUs despite the fact that the C3 is not very fast on many other tests.  For this reason, we strongly suggest that all quoted OSMark scores be accompanied with a stacked bar chart illustrating the result breakdown as shown below.

From the graph above, it is clear that the C3 is not nearly as balanced as the other CPUs in the chart, but at the same time it is obvious that the C3 has truly prodigious AES Encrypt/Decrypt performance.


Test Descriptions

Here is a very brief rundown of the 28 tests provided in version 0.99d of COSBI OSMark.  We use a VIA C3 system to demonstrate the behavior of a few of the tests because the C3 allows you to change the multiplier inside of the OS making CPU clock speed scaling experiments very easy to perform.

* GridBlast, GridBlastFP: One of the most time consuming spreadsheet tasks are iterative calculations over large datasets.  These two tests iterate across a grid full of numbers over a fixed range of calculations.  The first test uses only integers, while the second test uses only floating point data.  This test is influenced mostly CPU and 2d performace.  Both tests scale perfectly with CPU clock speed as can be seen in the graph below.

* Fibonacci( 40 ) is a recursive integer routine that calculates the fortieth Fibonacci number.  This very simple routine test is entirely CPU bound.

* n-Body is a simulation of four bodies interacting through an inverse square law and confined to a plane.  It is essentially a planetary orbit simulation.  The algorithm used is Euler's Last Point Approximation and is extremely FPU intensive.  The benchmark is almost completely bound by floating point performance, but 2d and memory bandwidth performance also influence scores.

* PlotTrig, PlotTrig2 test transcendental function performance, namely sine, cosine and tangent.  The first test plots every result and is therefore sensitive to 2d performance.  The second test only plots every hundredth point so is completely bound by FPU performance.

* Random Dots plots dots randomly over the surface of the window.  Although at first this test might intuitively appear to be entirely 2d bound, CPU performance has a very large impact on results.

* Circles plots filled circles all over the screen.  Again, this might appear to be 2d-bound, but CPU performance strongly influences throughput on this test.

* Mazes generates and then graphically solves complex mazes.  CPU and 2D performance both play a role in scores on this test.

* Fern Fractal is an FPU intensive fractal generation program.  Limited primarily by FPU, 2d also influences performance.

* The Rich Edit test is a simple word processor simulation.  While mainly a function of the 2d graphics core, CPU performance plays usually play a secondary role.

* Using an algorithm from famed Delphi author Ray Lischner, the Calculate Pi test does just that, it calculates pi.  This test is completely bound by the CPU.

* Dhrystone and Whetstone are two of the most famous benchmarks ever written.  Dhrystone is an integer test completely bound by the CPU, while Whetstone is FPU bound.

* BandwidthBP64 measures main memory bandwidth over a 16MB block.  The test block prefetchs 64 sequential cache lines at a time to optimize throughput.  This test is almost entirely bound by main memory bandwidth and FSB.

* In the real world, the most important aspects of memory latency is how much performance suffers with lots of randomized memory access.  The MemLatency attempts to produce a worst case scenario by copying random elements from a 16MB 32-bit integer array to random targets in another 16MB 32-bit integer array.  This test is extremely sensitive to memory timings and is good for tweaking BIOS settings for optimal performance.  Poor performance on the MemLatency test usually portends badly for business application performance.

* Maze Threads, Orthogonal Threads and Identical Threads test three distinctly different threading conditions.  Orthogonal Threads is the friendliest test for SMP systems.  With the two threads sharing neither code nor data, O. Threads should always scale almost perfectly with the addition of a second processor.  I. Threads should also scale well, but, since the quick sort threads are identical, poorly thought out SMP systems can thrash shared cache lines and degrade performance.  Maze Threads spawns two Maze tests and is an instance of a worst case multithreading scenario.  We have not seen any SMP or HyperThreading system scale positively when running this test yet.

* JPG Decode simply times how long it takes for your computer to decode a couple of JPG photographs.  This test is CPU, FPU and bandwidth sensitive.

* Image Resize will zoom and unzoom a JPG photograph.  On some systems, this test appears to be gated by the 2d core.

* Image Rotate is self descriptive and, on many systems, is largely a function of the 2d core (but there are exceptions).

* MP3 Encode uses Japan's popular GoGo encoder to test MP3 encoding performance.  It is FPU intensive.

* Web Page Load instantiates Internet Explorer and loads a set of web pages.  The web pages were taken from this site to avoid copyright issues.  Of course, JPG decoding ability has an impact on performance on this test.

* Zip Compress uses the free version of the ZipForge component and tests how well your system can zip up files.

* Security is becoming more and more vital for everyone and the Advanced Encryption Standard (AES) is the probably single most important security algorithm performed on mainstream computers today.  AES was recently adopted by American and European governments, and has been approved by the NSA for encrypting Secret and even Top Secret material.  AES is becoming important for encrypted wireless networking, real-time encrypted file systems and all forms of secure Internet communications.  CPU vendors AMD, Intel, Transmeta and VIA have all publicly demonstrated AES performance for their respective processors.  Consequently, AES was really the only serious option for our Encrypt/Decrypt test which we based upon Dr. Brian Gladman's code.  The VIA C3's fantastic showing on AES is one of the most remarkable advancements in computational functionality that we have ever seen.

* The File Copy test creates a 100MB file and copies it to five targets using standard Windows API methods.  This is a simple real life task and its merit should be clear.


Wrap Up

There are many more OSMark features that we have not yet discussed, but we will be more thorough for the official release.  We hope that you find COSBI OSMark to be useful.

We know that there are a number of bugs that still exist in OSMark.  If you encounter any of the six-legged critters, please let us know.  If you can think of improvements for OSMark, please send them in.

If you would like access to the pre-release source code, let us know.  We are especially interested in accumulating more tests.  Specifically, we need tests that stress SSE, SSE2, SSE3, DirectX and OpenGL.  DVD playback and video encoding are also on our list.

Is OSMark perfect?  No, but at least it is an earnest start towards an open, verifiable benchmark.