Date: July 29, 2002
One of the most wonderful performance enhancing
features of the forthcoming Opteron (Hammer) are the multiple independent direct
memory channels that are built into each CPU. This is a huge, huge
difference from the shared memory (actually, shared everything) approach of
Intel symmetric multi-processing (SMP) systems. Given the limitations of
today's memory technology, this was an exceedingly smart move on the part of AMD
and will likely change the way we design some of our applications in the very
Intel SMP systems exhibit terrible scalability beyond as few as 3 CPU's if programs and data don't fit into a local CPU's cache (unless the programs are designed from the start for parallelism). This is largely why Intel CPU's are being equipped with larger and larger cache sizes. Even large cache increases don't really have very high scalability on SMP shared memory systems since cache coherency maintenance between multiple CPU's is still a problem with serious limitations. The bus snooping and resultant cache flush/fill activity often cripple a given CPU for many cycles of operation. In addition, modern OS's use globally available data structures to coordinate activity among independent CPU's. The need for atomic test/set operations to support spinlocks and semaphores can chew up lots of memory bandwidth and suffers potentially severe latency delays when dirty cache needs flushing (thus forcing MMU's to resynchronize their cache images of shared memory). The latency just kills these systems for many, many applications. That is why scalability is so poor as can be seen in various tests running everyday apps on everyday OS's such as Unix and Windows NT/2K (when the fourth CPU only delivers a 10% performance kick the term "scalable" is a complete misnomer). Generally, the apps themselves are simply not designed to efficiently run multi-threaded across multiple CPU's (they are architecturally limited). It's not just the applications that have problems with SMP. The OS kernels are often a nightmare with critical code segments heavily dependent on a given hardware design (memory controller, cache subsystems, etc.) and with full SMP there is LOTS of activity competing for the available latency and bandwidth of the memory system.
Of course, a multi-threaded OS that is running on a multi-CPU system (actually, it's multi-process also) still needs some form of shared resource control and in the case of multi-threading some way of doing those spinlocks and semaphores. Most of our currently used software is capable of using multi-threading over multiple CPUs but is essentially designed for the shared memory model. How can we get past the performance bottleneck? Thread affinity and other techniques have been developed to help deal with the issue but the shared memory model is just plain the primary problem. There is only so much bandwidth to share and it gets used up fast. It's clear that memory bandwidth is generally THE limiting factor on SMP systems. Thus, to get away from the problem we will need to get away from SMP. It's that simple.
You could see this problem coming years ago as CPU speeds zoomed past memory performance which has plodded along at a sub-Moore development rate. There simply isn't enough memory performance in SMP to keep up with CPU's. For high performance and high volume applications (transaction processing, rendering, searching, etc.) there MUST be division of the workload among multiple separate memory systems. There are several ways to do this and currently they tend to be rather expensive. One way to get a lot of performance out of a multi-CPU configuration is to put the CPU's into separate systems.
When data centers are designed to handle massive loads, we often utilize separate systems that operate together in "clusters". Perhaps the company best at doing this was the now absorbed Digital Equipment Corporation (good old DEC). Their VMS operating system had a built in clustering capability that allowed multiple separate systems to share their resources in nearly complete harmony (with a significant amount of overhead for the inter-system management to take place). Each system of course had its own OS image and attached resources (which could be shared as desired) and various enhancements to the OS which allowed such things as distributed lock management, failover of applications, etc. This allowed any given system to "see" the resources of another and by passing an internally generated request to another system (or virtualizing it) gain access to the other resources. The point here is that each system ran its own apps and utilized the storage and I/O systems of other systems. It is pretty amazing when done properly (and you do have to have some skill at application development and cluster management to make it work well). We can do this with other operating systems, typically Linux (and IBM mainframes can too) and Microsoft is trying very hard to do the same (it ain't easy and their OS's are already such a kluge it may take them years yet to get it right!). The main benefit of clusters is that we don't have shared memory issues and the main drawback is that we still have distributed I/O performance issues, usually due to the bandwidth limits of the system interconnect.
Of course, clusters tend to be somewhat expensive and power hungry since we need to create completely separate systems and then link them with gigabit ethernet, fiber channel or some other high performance LAN that acts as a "backplane" for the systems to communicate over. (DEC used a thingy call CI or "cluster interconnect" that was very fast.) They work great but the cost of each box and the expense of hooking it together with state-of-the-art networking gear is quite high. Of course there is also the power and cooling expense to consider. If we could dream of a "perfect" system, what would we really need?
What we really want are the best features of a cluster without all the expense in hardware and energy cost. We also don't want just any old SMP system with its bandwidth and latency limitations. What will really help eliminate the SMP bottleneck issues is to have the CPU's do their own thing (just like any given system in a cluster) and then only communicate with another CPU when it must (just like in a cluster). With AMD's high end Opteron CPU's a lot of the cost can be eliminated and the whole idea of running applications on such a platform may change the way we "do" computing forever.
If you've looked at some functional block diagrams of AMD's intended 4-way high end Opteron systems, (see this .pdf file for reference) you will note that all 4 CPUs are interconnected via HT links (Hyper Transport). One of the CPU's is intended to be dedicated to serving as the primary display and I/O resource manager while the other three are dedicated to process or thread execution. One of the potential problems with this is that in most current OS designs process/thread scheduling is controlled by a single CPU and this can create more inefficiencies by requiring significant amounts of inter-process and thus inter-processor communication (IPC). There simply isn't any free lunch. But what if we didn't need lunch so often? If we just turned each CPU loose on it's own set of problems and let it manage them by itself, it would only have to "talk" to another CPU when it needed I/O or access to some mutually agreed to shared resource. Hmmm, this is starting to sound a bit like a cluster....
If we extend the concept a bit, why don't we just go ahead and provide a full mini-kernel to each CPU and let that CPU schedule and manage it's own list of processes? That can cut down significantly on IPC requirements and for certain applications (that are written properly) eliminate cache coherency and shared memory latency/bandwidth issues. There's still the shared I/O to deal with but at least that is handed off to a completely separate CPU. In the Opteron architecture, the HT links allow extremely efficient I/O transfers to the buffer memory of a CPU that wants to move data in/out. As Nils has pointed out, this is almost a mainframe class "channel" architecture (and it is very, very efficient). The HT channels can each handle over a giga-BYTE per second of data transfer and that's enough to gobble up the output of over a dozen striped disk drives simultaneously (if that isn't enough, just wait for the next rev of HT!). Each CPU's cache does its own thing with its own list of processes and cuts down dramatically on inter-CPU IPC.
Applications generally would run completely within a given CPU and should be designed for CPU thread affinity. This will keep the garden variety application design similar to how we do things today and keep things simple. This isn't much of a problem since these CPUS are smokin' anyway. However, when there is a compelling need for wringing out performance (such as OO and Relational DBMS, rendering, etc.) we can easily design threads that should run more or less independently on separate CPU's and then coordinate their intermediate results over those HT channels (Linux Beowulf clusters often do this already so the techniques are well known).
These are not new ideas but the imminent arrival of AMD's Opteron server class CPU's begs for a more advanced OS architecture than we are currently using in shared memory SMP systems. We will not be able to get the best from these systems without a superior OS design. Is there one available?
If you've followed so far you'll be interested in this heavily footnoted paper by Karim Yaghmour of Opersys. He suggests and analyzes what it will take to get a Linux kernel running separately in a multi-CPU environment. There is no mention of Hammer or Opteron but the concepts can easily be applied and without doubt more efficiently on the Opteron architecture than on Intel SMP.
Of course, the above doesn't even begin to outline the plethora of issues involved but it at least highlights where some of the biggest problems currently are and how the AMD Opteron architecture can potentially provide a significant advance. Think on ya'll!
Brief on paper
Opersys home page
Pssst! We've updated our Shopping Page.