Background — High Performance Attached Processors Handicapped By Architecture
The application of high-performance accelerators, notably GPUs, GPGPUs (APUs in AMD terminology) to a variety of computing problems has blossomed over the last decade, resulting in ever more affordable compute power for both horizon and mundane problems, along with growing revenue streams for a growing industry ecosystem. Adding heat to an already active mix, Intel’s Xeon Phi accelerators, the most recent addition to the GPU ecosystem, have the potential to speed adoption even further due to hoped-for synergies generated by the immense universe of x86 code that could potentially run on the Xeon Phi cores.
However, despite any potential synergies, GPUs (I will use this term generically to refer to all forms of these attached accelerators as they currently exist in the market) suffer from a fundamental architectural problem — they are very distant, in terms of latency, from the main scalar system memory and are not part of the coherent memory domain. This in turn has major impacts on performance, cost, design of the GPUs, and the structure of the algorithms:
Performance — The latency for memory accesses generally dictated by PCIe latencies, which while much improved over previous generations, are a factor of 100 or more longer than latency from coherent cache or local scalar CPU memory. While clever design and programming, such as overlapping and buffering multiple transfers can hide the latency in a series of transfers, it is difficult to hide the latency for an initial block of data. Even AMD’s integrated APUs, in which the GPU elements are on a common die, do not share a common memory space, and explicit transfers are made in and out of the APU memory.
With a couple of months' perspective, I’m pretty convinced that Intel has made a potentially disruptive entry in the market for programmable computational accelerators, often referred to as GPGPUs (General Purpose Graphics Processing Units) in deference to the fact that the market leaders, NVIDIA and AMD, have dominated the segment with parallel computational units derived from high-end GPUs. In late 2012, Intel, referring to the architecture as MIC (Many Independent Cores) introduced the Xeon Phi product, the long-awaited productization of the development project that was known internally (and to the rest of the world as well) as Knight’s Ferry, a MIC coprocessor with up to 62 modified Xeon cores implemented in its latest 22 nm process.
[For some reason this has been unpublished since April — so here it is well after AMD announced its next spin of the SeaMicro product.]
At its recent financial analyst day, AMD indicated that it intended to differentiate itself by creating products that were advantaged in niche markets, with specific mention, among other segments, of servers, and to generally shake up the trench warfare that has had it on the losing side of its lifelong battle with Intel (my interpretation, not AMD management’s words). Today, at least for the server side of the business, it made a move that can potentially offer it visibility and differentiation by acquiring innovative server startup SeaMicro.
SeaMicro has attracted our attention since its appearance (blog post 1, blog post 2) with its innovative architecture that dramatically reduces power and improves density by sharing components like I/O adapters, disks, and even BIOS over a proprietary fabric. The irony here is that SeaMicro came to market with a tight alignment with Intel, who at one point even introduced a special dual-core packaging of its Atom CPU to allow SeaMicro to improve its density and power efficiency. Most recently SeaMicro and Intel announced a new model that featured Xeon CPUs to address the more mainstream segments that were not a part of SeaMicro’s original Atom-based offering.
At its recent financial analyst day, AMD indicated that it intended to differentiate itself by creating products that were advantaged in niche markets, with specific mention, among other segments, of servers, and to generally shake up the trench warfare that has had it on the losing side of its lifelong battle with Intel (my interpretation, not AMD management’s words). Today, at least for the server side of the business AMD made a move that can potentially offer it visibility and differentiation by acquiring innovative server startup SeaMicro.
SeaMicro has attracted our attention since its appearance (blog post 1, blog post 2), with its innovative architecture that dramatically reduces power and improves density by sharing components like I/O adapters, disks, and even BIOS over a proprietary fabric. The irony here is that SeaMicro came to market with a tight alignment with Intel, who at one point even introduced a special dual-core packaging of its Atom CPU to allow SeaMicro to improve its density and power efficiency. Most recently SeaMicro and Intel announced a new model that featured Xeon CPUs to address the more mainstream segments that were not for SeaMicro’s original Atom-based offering.
In late 2010 I noted that startup SeaMicro had introduced an ultra-dense server using Intel Atom chips in an innovative fabric-based architecture that allowed them to factor out much of the power overhead from a large multi-CPU server ( http://blogs.forrester.com/richard_fichera/10-09-21-little_servers_big_applications_intel_developer_forum). Along with many observers, I noted that the original SeaMicro server was well-suited to many light-weight edge processing tasks, but that the system would not support more traditional compute-intensive tasks due to the performance of the Atom core. I was, however, quite taken with the basic architecture, which uses a proprietary high-speed (1.28 Tb/s) 3D mesh interconnect to allow the CPU cores to share network, BIOS and disk resources that are normally replicated on a per-server in conventional designs, with commensurate reductions in power and an increase in density.
While NVIDIA and to a lesser extent AMD (via its ATI branded product line) have effectively monopolized the rapidly growing and hyperbole-generating market for GPGPUs, highly parallel application accelerators, Intel has teased the industry for several years, starting with its 80-core Polaris Research Processor demonstration in 2008. Intel’s strategy was pretty transparent – it had nothing in this space, and needed to serve notice that it was actively pursuing it without showing its hand prematurely. This situation of deliberate ambiguity came to an end last month when Intel finally disclosed more details on its line of Many Independent Core (MIC) accelerators.
Intel’s approach to attached parallel processing is radically different than its competitors and appears to make excellent use of its core IP assets – fabrication and expertise and the x86 instruction set. While competing products from NVIDIA and AMD are based on graphics processing architectures, employing 100s of parallel non-x86 cores, Intel’s products will feature a smaller (32 – 64 in the disclosed products) number of simplified x86 cores on the theory that developers will be able to harvest large portions of code that already runs on 4 – 10 core x86 CPUs and easily port them to these new parallel engines.
Since its introduction of its Core 2 architecture, Intel reversed much of the damage done to it by AMD in the server space, with attendant publicity. AMD, however, has been quietly reclaiming some ground with its 12-core 6100 series CPUs, showing strength in benchmarks that emphasize high throughput in process-rich environments as opposed to maximum performance per core. Several AMD-based system products have also been cited by their manufacturers to us as enjoying very strong customer acceptance due to the throughput of the 12-core CPUs combined with their attractive pricing. As a fillip to this success, AMD this past week announced speed bumps for the 6100-series products to give a slight performance boost as they continue to compete with Intel’s Xeon 5600 and 7500 products (Intel’s Sandy Bridge server products have not yet been announced).
But the real news last week was the quiet subtext that the anticipated 16-core Interlagos products based on the new Bulldozer core appear to be on schedule for Q2 ’11 shipments system partners, who should probably be able to ship systems during Q3, and that AMD is still certifying them as compatible with the current sockets used for the 12-core 6000 CPUs. This implies that system partners will be able to quickly deliver products based on the new parts very rapidly.
Actual performance of these systems will obviously be dependent on the workloads being run, but our gut feeling is that while they will not rival the per-core performance of the Intel Xeon 7500 CPUs, for large throughput-oriented environments with high numbers of processes, a description that fits a large number of web and middleware environments, these CPUs, each with up to a 50% performance advantage per core over the current AMD CPUs, may deliver some impressive benchmarks and keep the competition in the server space at a boil, which in the end is always helpful to customers.
I’ve recently had the opportunity to talk with a small sample of SLES 11 and RH 6 Linux users, all developing their own applications. All were long-time Linux users, and two of them, one in travel services and one in financial services, had applications that can be described as both large and mission-critical.
The overall message is encouraging for Linux advocates, both the calm rational type as well as those who approach it with near-religious fervor. The latest releases from SUSE and Red Hat, both based on the 2.6.32 Linux kernel, show significant improvements in scalability and modest improvements in iso-configuration performance. One user reported that an application that previously had maxed out at 24 cores with SLES 10 was now nearing production certification with 48 cores under SLES 11. Performance scalability was reported as “not linear, but worth doing the upgrade.”
Overall memory scalability under Linux is still a question mark, since the widely available x86 platforms do not exceed 3 TB of memory, but initial reports from a user familiar with HP’s DL 980 verify that the new Linux Kernel can reliably manage at least 2TB of RAM under heavy load.
File system options continue to expand as well. The older Linux FS standard, ETX4, which can scale to “only” 16 TB, has been joined by additional options such as XFS (contributed by SGI), which has been implemented in several installations with file systems in excess of 100 TB, relieving a limitation that may have been more psychological than practical for most users.
I just spent some time talking to ScaleMP, an interesting niche player that provides a server virtualization solution. What is interesting about ScaleMP is that rather than splitting a single physical server into multiple VMs, they are the only successful offering (to the best of my knowledge) that allows I&O groups to scale up a collection of smaller servers to work as a larger SMP.
Others have tried and failed to deliver this kind of solution, but ScaleMP seems to have actually succeeded, with a claimed 200 customers and expectations of somewhere between 250 and 300 next year.
Their vSMP product comes in two flavors, one that allows a cluster of machines to look like a single system for purposes of management and maintenance while still running as independent cluster nodes, and one that glues the member systems together to appear as a single monolithic SMP.
Does it work? I haven’t been able to verify their claims with actual customers, but they have been selling for about five years, claim over 200 accounts, with a couple of dozen publicly referenced. All in all, probably too elaborate a front to maintain if there was really nothing there. The background of the principals and the technical details they were willing to share convinced me that they have a deep understanding of the low-level memory management, prefectching, and caching that would be needed to make a collection of systems function effectively as a single system image. Their smaller scale benchmarks displayed good scalability in the range of 4 – 8 systems, well short of their theoretical limits.
My quick take is that the software works, and bears investigation if you have an application that:
Either is certified to run with ScaleMP (not many), or one where that you control the code.
You understand the memory reference patterns of the application, and