Background — High Performance Attached Processors, Handicapped By Architecture
The application of high-performance accelerators, notably GPUs, GPGPUs (APUs in AMD terminology) to a variety of computing problems has blossomed over the last decade, resulting in ever more affordable compute power for both horizon and mundane problems, along with growing revenue streams for a growing industry ecosystem, most recently joined by Intel’s Xeon Phi accelerators which have to potential to speed adoption even further due to hoped-for synergies generated by the immense universe of x86 code that could potentially run on the Xeon Phi cores.
However, despite any potential synergies, GPUs (I will use this term generically to refer to all forms of these attached accelerators as they currently exist on the market) suffer from a fundamental architectural problem — they are very distant, in terms of latency, from the main scalar system memory and are not part of the coherent memory domain. This in turn has major impacts on performance, cost, design of the GPUs, and the structure of the algorithms:
Performance — The latency for memory accesses generally dictated by PCIe latencies, which while much improved over previous generations, are a factor of 100 or more longer than latency from coherent cache or local scalar CPU memory. While clever design and programming, such as overlapping and buffering multiple transfers can hide the latency in a series of transfers, it is difficult to hide the latency for an initial block of data. Even AMD’s integrated APUs, in which the GPU elements are on a common die, do not share a common memory space, and explicit transfers are made in and out of the APU memory.
The industry is abuzz with speculation that IBM will sell its x86 server business to Lenovo. As usual, neither party is talking publicly, but at this point I’d give it a better than even chance, since usually these kind of rumors tend to be based on leaks of real discussions as opposed to being completely delusional fantasies. Usually.
So the obvious question then becomes “Huh?”, or, slightly more eloquently stated, “Why would they do something like that?”. Aside from the possibility that this might all be fantasy, two explanations come to mind:
1. IBM is crazy.
2. IBM is not crazy.
Of the two explanations, I’ll have to lean toward the latter, although we might be dealing with a bit of the “Hey, I’m the new CEO and I’m going to do something really dramatic today” syndrome. IBM sold its PC business to Lenovo to the tune of popular disbelief and dire predictions, and it's doing very well today because it transferred its investments and focus to higher margin business, like servers and services. Lenovo makes low-end servers today that it bootstrapped with IBM licensed technology, and IBM is finding it very hard to compete with Lenovo and other low-cost providers. Maybe the margins on its commodity server business have sunk below some critical internal benchmark for return on investment, and it believes that it can get a better return on its money elsewhere.
In his 1956 dystopian sci-fi novel “The City and the Stars”, Arthur C. Clarke puts forth the fundamental design tenet for making eternal machines, “A machine shall have no moving parts”. To someone from the 1950s current computers would appear to come close to that ideal – the CPUs and memory perform silent magic and can, with some ingenuity, be passively cooled, and invisible electronic signals carry information in and out of them to networks and … oops, to rotating disks, still with us after more than five decades[i]. But, as we all know, salvation has appeared on the horizon in the form of solid-state storage, so called flash storage (actually an idea of several decades standing as well, just not affordable until recently).
The initial substitution of flash for conventional storage yields immediate gratification in the form of lower power, maybe lower cost if used effectively, and higher performance, but the ripple effect benefits of flash can be even more pervasive. However, the implementation of the major architectural changes engendered across the whole IT stack by the use of flash is a difficult conceptual challenge for users and largely addressed only piecemeal by most vendors. Enter IBM and its Flashahead initiative.
What is Happening?
On Friday, April 11, IBM announced a major initiative, to the tune of a spending commitment of $1B, to accelerate the use of flash technology by means of three major programs:
· Fundamental flash R&D
· New storage products built on flash-only memory technology
HP today announced the Moonshot 1500 server, their first official volume product in the Project Moonshot server product family (the initial Redstone, a Calxeda ARM-based server, was only available in limited quantities as a development system), and it represents both a significant product today and a major stake in the ground for future products, both from HP and eventually from competitors. It’s initial attractions – an extreme density low power x86 server platform for a variety of low-to-midrange CPU workloads – hides the fact that it is probably a blueprint for both a family of future products from HP as well as similar products from other vendors.
Geek Stuff – What was Announced
The Moonshot 1500 is a 4.3U enclosure that can contain up to 45 plug-in server cartridges, each one a complete server node with a dual-core Intel Atom 1200 CPU, up to 8 GB of memory and a single disk or SSD device, up to 1 TB, and the servers share common power supplies and cooling. But beyond the density, the real attraction of the MS1500 is its scalable fabric and CPU-agnostic architecture. Embedded in the chassis are multiple fabrics for storage, management and network giving the MS1500 (my acronym, not an official HP label) some of the advantages of a blade server without the advanced management capabilities. At initial shipment, only the network and management fabric will be enabled by the system firmware, with each chassis having up two Gb Ethernet switches (technically they can be configured with one, but nobody will do so), allowing the 45 servers to share uplinks to the enterprise network.
With a couple of months' perspective, I’m pretty convinced that Intel has made a potentially disruptive entry in the market for programmable computational accelerators, often referred to as GPGPUs (General Purpose Graphics Processing Units) in deference to the fact that the market leaders, NVIDIA and AMD, have dominated the segment with parallel computational units derived from high-end GPUs. In late 2012, Intel, referring to the architecture as MIC (Many Independent Cores) introduced the Xeon Phi product, the long-awaited productization of the development project that was known internally (and to the rest of the world as well) as Knight’s Ferry, a MIC coprocessor with up to 62 modified Xeon cores implemented in its latest 22 nm process.
When I returned to Forrester in mid-2010, one of the first blog posts I wrote was about Oracle’s new roadmap for SPARC and Solaris, catalyzed by numerous client inquiries and other interactions in which Oracle’s real level of commitment to future SPARC hardware was the topic of discussion. In most cases I could describe the customer mood as skeptical at best, and panicked and committed to migration off of SPARC and Solaris at worst. Nonetheless, after some time spent with Oracle management, I expressed my improved confidence in the new hardware team that Oracle had assembled and their new roadmap for SPARC processors after the successive debacles of the UltraSPARC-5 and Rock processors under Sun’s stewardship.
Two and a half years later, it is obvious that Oracle has delivered on its commitments regarding SPARC and is continuing its investments in SPARC CPU and system design as well as its Solaris OS technology. The latest evolution of SPARC technology, the SPARC T5 and the soon-to-be-announced M5, continue the evolution and design practices set forth by Oracle’s Rick Hetherington in 2010 — incremental evolution of a common set of SPARC cores, differentiation by variation of core count, threads and cache as opposed to fundamental architecture, and a reliable multi-year performance progression of cores and system scalability.
HP seems to be on a tear, bouncing from litigation with one of its historically strongest partners to multiple CEOs in the last few years, continued layoffs, and a recent massive write-down of its EDS purchase. And, as we learned last week, the circus has not left town. The latest “oops” is an $8.8 billion write-down for its purchase of Autonomy, under the brief and ill-fated leadership of Léo Apotheker, combined with allegations of serious fraud on the part of Autonomy during the acquisition process.
The eventual outcome of this latest fiasco will be fun to watch, with many interesting sideshows along the way, including:
Whose fault is it? Can they blame it on Léo, or will it spill over onto Meg Whitman, who was on the board and approved it?
Was there really fraud involved?
If so, how did HP miss it? What about all the internal and external people involved in due diligence of this acquisition? I’ve been on the inside of attempted acquisitions at HP, and there were always many more people around with the power to say “no” than there were people who were trying to move the company forward with innovative acquisitions, and the most persistent and compulsive of the group were the various finance groups involved. It’s really hard to see how they could have missed a little $5 billion discrepancy in revenues, but that’s just my opinion — I was usually the one trying to get around the finance guys. :)
On Tuesday November 8, after more than a year of pre-announcement disclosures that eventually left very little to the imagination, Intel finally announced the Itanium 9500, formerly known as Poulson. Added to this was the big surprise of HP announcing a refresh of its current line of Integrity servers, from blades to the large Superdome servers, with the new Itanium 9500.
As noted in an earlier post, the Itanium 9500 offers considerable performance improvements over its predecessors, and instantiated in HP’s new Integrity line it is positioned as delivering between 2X and 3X the performance per socket as previous Itanium 9300 (Tukwilla) systems at approximately the same price. For those remaining committed to Itanium and its attendant OS platforms, notably HP-UX, this is unmitigated good news. The fly in the ointment (I have never seen a fly in any ointment, but it does sound gross), of course, is HP’s dispute with Oracle. Despite the initial judgment in HP’s favor, the trial is a) not over yet, and b) Oracle has already filed for an early appeal of the initial verdict, which would ordinarily have to wait until the second phase of the trial, scheduled for next year, to finish. The net takeaway is that Oracle’s future availability on Itanium and HP-UX is not yet assured, so we really cannot advise the large number of Oracle users who will require Oracle 12 and later versions to relax yet.
Earlier this week, in conjunction with ARM Holdings plc’s announcement of the upcoming Cortex A53 and A57, full 64-bit CPU implementations based on the ARM V8 specification, AMD also announced that it would be designing and selling SOC (System On a Chip) products based on this technology in 2014, roughly coinciding with availability of 64-bit parts from ARM and other partners.
This is a major event in the ARM ecosystem. AMD, while much smaller than Intel, is still a multi-billion-dollar enterprise, and for the second largest vendor of x86 chips to also throw its hat into the ARM ecosystem and potentially compete with its own mainstream server and desktop CPU business is an aggressive move on the part of AMD management that carries some risk and much potential advantage.
Reduced to its essentials, what AMD announced (and in some cases hinted at):
Intention to produce A53/A57 SOC modules for multiple server segments. There was no formal statement of intentions regarding tablet/mobile devices, but it doesn’t take a rocket scientist to figure out that AMD wants a piece of this market, and ARM is a way to participate.
The announcement is wider that just the SOC silicon. AMD also hinted at making a range of IP, including its fabric architecture from the SeaMicro architecture, available in the form of “reusable IP blocks.” My interpretation is that it intends to make the fabric, reference architectures, and various SOCs available to its hardware system partners.
Nathan Bedford Forrest, a Confederate general of despicable ideology and consummate tactics, spoke of “keepin up the skeer,” applying continued pressure to opponents to prevent them from regrouping and counterattacking. POWER7+, the most recent version of IBM’s POWER architecture, anticipated as a follow-up to the POWER7 for almost a year, was finally announced this week, and appears to be “keepin up the skeer” in terms of its competitive potential for IBM POWER-based systems. In short, it is a hot piece of technology that will keep existing IBM users happy and should help IBM maintain its impressive momentum in the Unix systems segment.
For the chip heads, the CPU is implemented in a 32 NM process, the same as Intel’s upcoming Poulson, and embodies some interesting evolutions in high-end chip design, including:
Use of DRAM instead of SRAM — IBM has pioneered the use of embedded DRAM (eDRAM) as embedded L3 cache instead of the more standard and faster SRAM. In exchange for the loss of speed, eDRAM requires fewer transistors and lower power, allowing IBM to pack a total of 80 MB (a lot) of shared L3 cache, far more than any other product has ever sported.