NVIDIA recently shared a case study involving risk calculations at a JP Morgan Chase that I think is significant for the extreme levels of acceleration gained by integrating GPUs with conventional CPUs, and also as an illustration of a mainstream financial application of GPU technology.
JP Morgan Chase’s Equity Derivatives Group began evaluating GPUs as computational accelerators in 2009, and now runs over half of their risk calculations on hybrid systems containing x86 CPUs and NVIDIA Tesla GPUs, and claims a 40x improvement in calculation times combined with a 75% cost savings. The cost savings appear to be derived from a combination of lower capital costs to deliver an equivalent throughput of calculations along with improved energy efficiency per calculation.
Implicit in the speedup of 40x, from multiple hours to several minutes, is the implication that these calculations can become part of a near real-time business-critical analysis process instead of an overnight or daily batch process. Given the intensely competitive nature of derivatives trading, it is highly likely that JPMC will enhance their use of GPUs as traders demand an ever increasing number of these calculations. And of course, their competition has been using the same technology as well, based on numerous conversations I have had with Wall Street infrastructure architects over the past year.
My net take on this is that we will see a succession of similar announcements as GPUs become a fully mainstream acceleration technology as opposed to an experimental fringe. If you are an I&O professional whose users are demanding extreme computational performance on a constrained space, power and capital budget, you owe it to yourself and your company to evaluate the newest accelerator technology. Your competitors are almost certainly doing so.
While NVIDIA and to a lesser extent AMD (via its ATI branded product line) have effectively monopolized the rapidly growing and hyperbole-generating market for GPGPUs, highly parallel application accelerators, Intel has teased the industry for several years, starting with its 80-core Polaris Research Processor demonstration in 2008. Intel’s strategy was pretty transparent – it had nothing in this space, and needed to serve notice that it was actively pursuing it without showing its hand prematurely. This situation of deliberate ambiguity came to an end last month when Intel finally disclosed more details on its line of Many Independent Core (MIC) accelerators.
Intel’s approach to attached parallel processing is radically different than its competitors and appears to make excellent use of its core IP assets – fabrication and expertise and the x86 instruction set. While competing products from NVIDIA and AMD are based on graphics processing architectures, employing 100s of parallel non-x86 cores, Intel’s products will feature a smaller (32 – 64 in the disclosed products) number of simplified x86 cores on the theory that developers will be able to harvest large portions of code that already runs on 4 – 10 core x86 CPUs and easily port them to these new parallel engines.
Intel has been publishing research for about a decade on what they call “3D Trigate” transistors, which held out the hope for both improved performance as well as power efficiency. Today Intel revealed details of its commercialization of this research in its upcoming 22 nm process as well as demonstrating actual systems based on 22 nm CPU parts.
The new products, under the internal name of “Ivy Bridge”, are the process shrink of the recently announced Sandy Bridge architecture in the next “Tock” cycle of the famous Intel “Tick-Tock” design methodology, where the “Tick” is a new optimized architecture and the “Tock” is the shrinking of this architecture onto then next generation semiconductor process.
What makes these Trigate transistors so innovative is the fact that they change the fundamental geometry of the semiconductors from a basically flat “planar” design to one with more vertical structure, earning them the description of “3D”. For users the concepts are simpler to understand – this new transistor design, which will become the standard across all of Intel’s products moving forward, delivers some fundamental benefits to CPUs implemented with them:
Leakage current is reduced to near zero, resulting in very efficient operation for system in an idle state.
Power consumption at equivalent performance is reduced by approximately 50% from Sandy Bridge’s already improved results with its 32 nm process.
A lot has been written about potential threats to Intel’s low-power server hegemony, including discussions of threats from not only its perennial minority rival AMD but also from emerging non-x86 technologies such as ARM servers. While these are real threats, with potential for disrupting Intel’s position in the low power and small form factor server segment if left unanswered, Intel’s management has not been asleep at the wheel. As part of the rollout of the new Sandy Bridge architecture, Intel recently disclosed their platform strategy for what they are defining as “Micro Servers,” small single-socket servers with shared power and cooling to improve density beyond the generally accepted dividing line of one server per RU that separates “standard density” from “high density.” While I think that Intel’s definition is a bit myopic, mostly serving to attach a label to a well established category, it is a useful tool for segmenting low-end servers and talking about the relevant workloads.
Intel’s strategy revolves around introducing successive generations of its Sandy Bridge and future architectures embodied as Low Power (LP) and Ultra Low Power (ULP) products with promises of up to 2.2X performance per watt and 30% less actual power compared to previous generation equivalent x86 servers, as outlined in the following chart from Intel:
So what does this mean for Infrastructure & Operations professionals interested in serving the target loads for micro servers, such as:
The world of hyper scale web properties has been shrouded in secrecy, with major players like Google and Amazon releasing only tantalizing dribbles of information about their infrastructure architecture and facilities, on the presumption that this information represented critical competitive IP. In one bold gesture, Facebook, which has certainly catapulted itself into the ranks of top-tier sites, has reversed that trend by simultaneously disclosing a wealth of information about the design of its new data center in rural Oregon and contributing much of the IP involving racks, servers, and power architecture to an open forum in the hopes of generating an ecosystem of suppliers to provide future equipment to themselves and other growing web companies.
The Data Center
By approaching the design of the data center as an integrated combination of servers for known workloads and the facilities themselves, Facebook has broken some new ground in data center architecture with its facility.
At a high level, a traditional enterprise DC has a utility transformer that feeds power to a centralized UPS, and then power is subsequently distributed through multiple levels of PDUs to the equipment racks. This is a reliable and flexible architecture, and one that has proven its worth in generations of commercial data centers. Unfortunately, in exchange for this flexibility and protection, it extracts a penalty of 6% to 7% of power even before it reaches the IT equipment.
Intel today publicly announced its anticipated “Westmere EX” high end Westmere architecture server CPU as the E7, now part of a new family nomenclature encompassing entry (E3), midrange (E5), and high-end server CPUs (E7), and at first glance it certainly looks like it delivers on the promise of the Westmere architecture with enhancements that will appeal to buyers of high-end x86 systems.
The E7 in a nutshell:
32 nm CPU with up to 10 cores, each with hyper threading, for up to 20 threads per socket.
Intel claims that the system-level performance will be up to 40% higher than the prior generation 8-core Nehalem EX. Notice that the per-core performance improvement is modest (although Intel does offer a SKU with 8 cores and a slightly higher clock rate for those desiring ultimate performance per thread).
Improvements in security with Intel Advanced Encryption Standard New Instruction (AES-NI) and Intel Trusted Execution Technology (Intel TXT).
Major improvements in power management by incorporating the power management capabilities from the Xeon 5600 CPUs, which include more aggressive P states, improved idle power operation, and the ability to separately reduce individual core power setting depending on workload, although to what extent this is supported on systems that do not incorporate Intel’s Node Manager software is not clear.
Calxeda, one of the most visible stealth mode startups in the industry, has finally given us an initial peek at the first iteration of its server plans, and they both meet our inflated expectations from this ARM server startup and validate some of the initial claims of ARM proponents.
While still holding their actual delivery dates and details of specifications close to their vest, Calxeda did reveal the following cards from their hand:
The first reference design, which will be provided to OEM partners as well as delivered directly to selected end users and developers, will be based on an ARM Cortex A9 quad-core SOC design.
The SOC, as Calxeda will demonstrate with one of its reference designs, will enable OEMs to design servers as dense as 120 ARM quad-core nodes (480 cores) in a 2U enclosure, with an average consumption of about 5 watts per node (1.25 watts per core) including DRAM.
While not forthcoming with details about the performance, topology or protocols, the SOC will contain an embedded fabric for the individual quad-core SOC servers to communicate with each other.
Most significantly for prospective users, Calxeda is claiming, and has some convincing models to back up these claims, that they will provide a performance advantage of 5X to 10X the performance/watt and (even higher when price is factored in for a metric of performance/watt/$) of any products they expect to see when they bring the product to market.
Intel, despite a popular tendency to associate a dominant market position with indifference to competitive threats, has not been sitting still waiting for the ARM server phenomenon to engulf them in a wave of ultra-low-power servers. Intel is fiercely competitive, and it would be silly for any new entrants to assume that Intel will ignore a threat to the heart of a high-growth segment.
In 2009, Intel released a microserver specification for compact low-power servers, and along with competitor AMD, it has been aggressive in driving down the power envelope of its mainstream multicore x86 server products. Recent momentum behind ARM-based servers has heated this potential competition up, however, and Intel has taken the fight deeper into the low-power realm with the recent introduction of the N570, a an existing embedded low-power processor, as a server CPU aimed squarely at emerging ultra-low-power and dense servers. The N570, a dual-core Atom processor, is being currently used by a single server partner, ultra-dense server manufacturer SeaMicro (see Little Servers For Big Applications At Intel Developer Forum), and will allow them to deliver their current 512 Atom cores with half the number of CPU components and some power savings.
Technically, the N570 is a dual-core Atom CPU with 64 bit arithmetic, a differentiator against ARM, and the same 32-bit (4 GB) physical memory limitations as current ARM designs, and it should have a power dissipation of between 8 and 10 watts.
Since its introduction of its Core 2 architecture, Intel reversed much of the damage done to it by AMD in the server space, with attendant publicity. AMD, however, has been quietly reclaiming some ground with its 12-core 6100 series CPUs, showing strength in benchmarks that emphasize high throughput in process-rich environments as opposed to maximum performance per core. Several AMD-based system products have also been cited by their manufacturers to us as enjoying very strong customer acceptance due to the throughput of the 12-core CPUs combined with their attractive pricing. As a fillip to this success, AMD this past week announced speed bumps for the 6100-series products to give a slight performance boost as they continue to compete with Intel’s Xeon 5600 and 7500 products (Intel’s Sandy Bridge server products have not yet been announced).
But the real news last week was the quiet subtext that the anticipated 16-core Interlagos products based on the new Bulldozer core appear to be on schedule for Q2 ’11 shipments system partners, who should probably be able to ship systems during Q3, and that AMD is still certifying them as compatible with the current sockets used for the 12-core 6000 CPUs. This implies that system partners will be able to quickly deliver products based on the new parts very rapidly.
Actual performance of these systems will obviously be dependent on the workloads being run, but our gut feeling is that while they will not rival the per-core performance of the Intel Xeon 7500 CPUs, for large throughput-oriented environments with high numbers of processes, a description that fits a large number of web and middleware environments, these CPUs, each with up to a 50% performance advantage per core over the current AMD CPUs, may deliver some impressive benchmarks and keep the competition in the server space at a boil, which in the end is always helpful to customers.
Last week IBM and ARM Holdings Plc quietly announced a continuation of their collaboration on advanced process technology, this time with a stated goal of developing ARM IP optimized for IBM physical processes down to a future 14 nm size. The two companies have been collaborating on semiconductors and SOC design since 2007, and this extension has several important ramifications for both companies and their competitors.
It is a clear indication that IBM retains a major interest in low-power and mobile computing, despite its previous divestment of its desktop and laptop computers to Lenovo, and that it will be in a position to harvest this technology, particularly ARM's modular approach to composing SOC systems, for future productization.
For ARM, the implications are clear. Its latest announced product, the Cortex A15, which will probably appear in system-level products in approximately 2013, will be initially produced in 32 nm with a roadmap to 20nm. The existence of a roadmap to a potential 14 nm product serves notice that the new ARM architecture will have a process roadmap that will keep it on Intel’s heels for another decade. ARM has parallel alliances with TSMC and Samsung as well, and there is no reason to think that these will not be extended, but the IBM alliance is an additional insurance policy. As well as a source of semiconductor technology, IBM has a deep well of systems and CPU IP that certainly cannot hurt ARM.