Intel has made no secret of its development of the Xeon D, an SOC product designed to take Xeon processing close to power levels and product niches currently occupied by its lower-power and lower performance Atom line, and where emerging competition from ARM is more viable.
The new Xeon D-1500 is clear evidence that Intel “gets it” as far as platforms for hyperscale computing and other throughput per Watt and density-sensitive workloads, both in the enterprise and in the cloud are concerned. The D1500 breaks new ground in several areas:
It is the first Xeon SOC, combining 4 or 8 Xeon cores with embedded I/O including SATA, PCIe and multiple 10 nd 1 Gb Ethernet ports.
It is the first of Intel’s 14 nm server chips expected to be introduced this year. This expected process shrink will also deliver a further performance and performance per Watt across the entire line of entry through mid-range server parts this year.
Why is this significant?
With the D-1500, Intel effectively draws a very deep line in the sand for emerging ARM technology as well as for AMD. The D1500, with 20W – 45W power, delivers the lower end of Xeon performance at power and density levels previously associated with Atom, and close enough to what is expected from the newer generation of higher performance ARM chips to once again call into question the viability of ARM on a pure performance and efficiency basis. While ARM implementations with embedded accelerators such as DSPs may still be attractive in selected workloads, the availability of a mainstream x86 option at these power levels may blunt the pace of ARM design wins both for general-purpose servers as well as embedded designs, notably for storage systems.
One of the developing trends in computing, relevant to both enterprise and service providers alike, is the notion of workload-specific or application-centric computing architectures. These architectures, optimized for specific workloads, promise improved efficiencies for running their targeted workloads, and by extension the services that they support. Earlier this year we covered the basics of this concept in “Optimize Scalable Workload-Specific Infrastructure for Customer Experiences”, and this week HP has announced a pair of server cartridges for their Moonshot system that exemplify this concept, as well as being representative of the next wave of ARM products that will emerge during the remainder of 2014 and into 2015 to tilt once more at the x86 windmill that currently dominates the computing landscape.
Specifically, HP has announced the ProLiant m400 Server Cartridge (m400) and the ProLiant m800 Server Cartridge (m800), both ARM-based servers packaged as cartridges for the HP Moonshot system, which can hold up to 45 of these cartridges in its approximately 4U enclosure. These servers are interesting from two perspectives – that they are both ARM-based products, one being the first tier-1 vendor offering of a 64-bit ARM CPU and that they are both being introduced with a specific workload target in mind for which they have been specifically optimized.
Background — High Performance Attached Processors Handicapped By Architecture
The application of high-performance accelerators, notably GPUs, GPGPUs (APUs in AMD terminology) to a variety of computing problems has blossomed over the last decade, resulting in ever more affordable compute power for both horizon and mundane problems, along with growing revenue streams for a growing industry ecosystem. Adding heat to an already active mix, Intel’s Xeon Phi accelerators, the most recent addition to the GPU ecosystem, have the potential to speed adoption even further due to hoped-for synergies generated by the immense universe of x86 code that could potentially run on the Xeon Phi cores.
However, despite any potential synergies, GPUs (I will use this term generically to refer to all forms of these attached accelerators as they currently exist in the market) suffer from a fundamental architectural problem — they are very distant, in terms of latency, from the main scalar system memory and are not part of the coherent memory domain. This in turn has major impacts on performance, cost, design of the GPUs, and the structure of the algorithms:
Performance — The latency for memory accesses generally dictated by PCIe latencies, which while much improved over previous generations, are a factor of 100 or more longer than latency from coherent cache or local scalar CPU memory. While clever design and programming, such as overlapping and buffering multiple transfers can hide the latency in a series of transfers, it is difficult to hide the latency for an initial block of data. Even AMD’s integrated APUs, in which the GPU elements are on a common die, do not share a common memory space, and explicit transfers are made in and out of the APU memory.