Background — High Performance Attached Processors Handicapped By Architecture
The application of high-performance accelerators, notably GPUs, GPGPUs (APUs in AMD terminology) to a variety of computing problems has blossomed over the last decade, resulting in ever more affordable compute power for both horizon and mundane problems, along with growing revenue streams for a growing industry ecosystem. Adding heat to an already active mix, Intel’s Xeon Phi accelerators, the most recent addition to the GPU ecosystem, have the potential to speed adoption even further due to hoped-for synergies generated by the immense universe of x86 code that could potentially run on the Xeon Phi cores.
However, despite any potential synergies, GPUs (I will use this term generically to refer to all forms of these attached accelerators as they currently exist in the market) suffer from a fundamental architectural problem — they are very distant, in terms of latency, from the main scalar system memory and are not part of the coherent memory domain. This in turn has major impacts on performance, cost, design of the GPUs, and the structure of the algorithms:
Performance — The latency for memory accesses generally dictated by PCIe latencies, which while much improved over previous generations, are a factor of 100 or more longer than latency from coherent cache or local scalar CPU memory. While clever design and programming, such as overlapping and buffering multiple transfers can hide the latency in a series of transfers, it is difficult to hide the latency for an initial block of data. Even AMD’s integrated APUs, in which the GPU elements are on a common die, do not share a common memory space, and explicit transfers are made in and out of the APU memory.
NVIDIA recently shared a case study involving risk calculations at a JP Morgan Chase that I think is significant for the extreme levels of acceleration gained by integrating GPUs with conventional CPUs, and also as an illustration of a mainstream financial application of GPU technology.
JP Morgan Chase’s Equity Derivatives Group began evaluating GPUs as computational accelerators in 2009, and now runs over half of their risk calculations on hybrid systems containing x86 CPUs and NVIDIA Tesla GPUs, and claims a 40x improvement in calculation times combined with a 75% cost savings. The cost savings appear to be derived from a combination of lower capital costs to deliver an equivalent throughput of calculations along with improved energy efficiency per calculation.
Implicit in the speedup of 40x, from multiple hours to several minutes, is the implication that these calculations can become part of a near real-time business-critical analysis process instead of an overnight or daily batch process. Given the intensely competitive nature of derivatives trading, it is highly likely that JPMC will enhance their use of GPUs as traders demand an ever increasing number of these calculations. And of course, their competition has been using the same technology as well, based on numerous conversations I have had with Wall Street infrastructure architects over the past year.
My net take on this is that we will see a succession of similar announcements as GPUs become a fully mainstream acceleration technology as opposed to an experimental fringe. If you are an I&O professional whose users are demanding extreme computational performance on a constrained space, power and capital budget, you owe it to yourself and your company to evaluate the newest accelerator technology. Your competitors are almost certainly doing so.
Since its introduction of its Core 2 architecture, Intel reversed much of the damage done to it by AMD in the server space, with attendant publicity. AMD, however, has been quietly reclaiming some ground with its 12-core 6100 series CPUs, showing strength in benchmarks that emphasize high throughput in process-rich environments as opposed to maximum performance per core. Several AMD-based system products have also been cited by their manufacturers to us as enjoying very strong customer acceptance due to the throughput of the 12-core CPUs combined with their attractive pricing. As a fillip to this success, AMD this past week announced speed bumps for the 6100-series products to give a slight performance boost as they continue to compete with Intel’s Xeon 5600 and 7500 products (Intel’s Sandy Bridge server products have not yet been announced).
But the real news last week was the quiet subtext that the anticipated 16-core Interlagos products based on the new Bulldozer core appear to be on schedule for Q2 ’11 shipments system partners, who should probably be able to ship systems during Q3, and that AMD is still certifying them as compatible with the current sockets used for the 12-core 6000 CPUs. This implies that system partners will be able to quickly deliver products based on the new parts very rapidly.
Actual performance of these systems will obviously be dependent on the workloads being run, but our gut feeling is that while they will not rival the per-core performance of the Intel Xeon 7500 CPUs, for large throughput-oriented environments with high numbers of processes, a description that fits a large number of web and middleware environments, these CPUs, each with up to a 50% performance advantage per core over the current AMD CPUs, may deliver some impressive benchmarks and keep the competition in the server space at a boil, which in the end is always helpful to customers.