Calxeda, one of the most visible stealth mode startups in the industry, has finally given us an initial peek at the first iteration of its server plans, and they both meet our inflated expectations from this ARM server startup and validate some of the initial claims of ARM proponents.
While still holding their actual delivery dates and details of specifications close to their vest, Calxeda did reveal the following cards from their hand:
The first reference design, which will be provided to OEM partners as well as delivered directly to selected end users and developers, will be based on an ARM Cortex A9 quad-core SOC design.
The SOC, as Calxeda will demonstrate with one of its reference designs, will enable OEMs to design servers as dense as 120 ARM quad-core nodes (480 cores) in a 2U enclosure, with an average consumption of about 5 watts per node (1.25 watts per core) including DRAM.
While not forthcoming with details about the performance, topology or protocols, the SOC will contain an embedded fabric for the individual quad-core SOC servers to communicate with each other.
Most significantly for prospective users, Calxeda is claiming, and has some convincing models to back up these claims, that they will provide a performance advantage of 5X to 10X the performance/watt and (even higher when price is factored in for a metric of performance/watt/$) of any products they expect to see when they bring the product to market.
Intel, despite a popular tendency to associate a dominant market position with indifference to competitive threats, has not been sitting still waiting for the ARM server phenomenon to engulf them in a wave of ultra-low-power servers. Intel is fiercely competitive, and it would be silly for any new entrants to assume that Intel will ignore a threat to the heart of a high-growth segment.
In 2009, Intel released a microserver specification for compact low-power servers, and along with competitor AMD, it has been aggressive in driving down the power envelope of its mainstream multicore x86 server products. Recent momentum behind ARM-based servers has heated this potential competition up, however, and Intel has taken the fight deeper into the low-power realm with the recent introduction of the N570, a an existing embedded low-power processor, as a server CPU aimed squarely at emerging ultra-low-power and dense servers. The N570, a dual-core Atom processor, is being currently used by a single server partner, ultra-dense server manufacturer SeaMicro (see Little Servers For Big Applications At Intel Developer Forum), and will allow them to deliver their current 512 Atom cores with half the number of CPU components and some power savings.
Technically, the N570 is a dual-core Atom CPU with 64 bit arithmetic, a differentiator against ARM, and the same 32-bit (4 GB) physical memory limitations as current ARM designs, and it should have a power dissipation of between 8 and 10 watts.
Since its introduction of its Core 2 architecture, Intel reversed much of the damage done to it by AMD in the server space, with attendant publicity. AMD, however, has been quietly reclaiming some ground with its 12-core 6100 series CPUs, showing strength in benchmarks that emphasize high throughput in process-rich environments as opposed to maximum performance per core. Several AMD-based system products have also been cited by their manufacturers to us as enjoying very strong customer acceptance due to the throughput of the 12-core CPUs combined with their attractive pricing. As a fillip to this success, AMD this past week announced speed bumps for the 6100-series products to give a slight performance boost as they continue to compete with Intel’s Xeon 5600 and 7500 products (Intel’s Sandy Bridge server products have not yet been announced).
But the real news last week was the quiet subtext that the anticipated 16-core Interlagos products based on the new Bulldozer core appear to be on schedule for Q2 ’11 shipments system partners, who should probably be able to ship systems during Q3, and that AMD is still certifying them as compatible with the current sockets used for the 12-core 6000 CPUs. This implies that system partners will be able to quickly deliver products based on the new parts very rapidly.
Actual performance of these systems will obviously be dependent on the workloads being run, but our gut feeling is that while they will not rival the per-core performance of the Intel Xeon 7500 CPUs, for large throughput-oriented environments with high numbers of processes, a description that fits a large number of web and middleware environments, these CPUs, each with up to a 50% performance advantage per core over the current AMD CPUs, may deliver some impressive benchmarks and keep the competition in the server space at a boil, which in the end is always helpful to customers.
One evening in 1972 I was hanging out in the computer science department at UC Berkeley with a couple of equally socially backward friends waiting for our batch programs to run, and to kill some time we dropped in on a nearby physics lab that was analyzing photographs of particle tracks from one of the various accelerators that littered the Lawrence Radiation Laboratory. Analyzing these tracks was real scut work – the overworked grad student had to measure angles between tracks, length of tracks, and apply a number of calculations to them to determine if they were of interest. To our surprise, this lab had something we had never seen before – a computer-assisted screening device that scanned the photos and in a matter of seconds determined it had any formations that were of interest. It had a big light table, a fancy scanner, whirring arms and levers and gears, and off in the corner, the computer, “a PDP from Digital Equipment.” It was a 19” rack mount box with an impressive array of lights and switches on the front. As a programmer of the immense 1 MFLOP CDC 6400 in the Rad Lab computer center, I was properly dismissive…
This was a snapshot of the dawn of the personal computer era, almost a decade before IBM Introduced the PC and blew it wide open. The PDP (Programmable Data Processor) systems from MIT Professor Ken Olsen were the beginning of the fundamental change in the relationship between man and computer, putting a person in the computing loop instead of keeping them standing outside the temple.
Last week IBM and ARM Holdings Plc quietly announced a continuation of their collaboration on advanced process technology, this time with a stated goal of developing ARM IP optimized for IBM physical processes down to a future 14 nm size. The two companies have been collaborating on semiconductors and SOC design since 2007, and this extension has several important ramifications for both companies and their competitors.
It is a clear indication that IBM retains a major interest in low-power and mobile computing, despite its previous divestment of its desktop and laptop computers to Lenovo, and that it will be in a position to harvest this technology, particularly ARM's modular approach to composing SOC systems, for future productization.
For ARM, the implications are clear. Its latest announced product, the Cortex A15, which will probably appear in system-level products in approximately 2013, will be initially produced in 32 nm with a roadmap to 20nm. The existence of a roadmap to a potential 14 nm product serves notice that the new ARM architecture will have a process roadmap that will keep it on Intel’s heels for another decade. ARM has parallel alliances with TSMC and Samsung as well, and there is no reason to think that these will not be extended, but the IBM alliance is an additional insurance policy. As well as a source of semiconductor technology, IBM has a deep well of systems and CPU IP that certainly cannot hurt ARM.
From nothing more than an outlandish speculation, the prospects for a new entrant into the volume Linux and Windows server space have suddenly become much more concrete, culminating in an immense buzz at CES as numerous players, including NVIDIA and Microsoft, stoked the fires with innuendo, announcements, and demos.
Consumers of x86 servers are always on the lookout for faster, cheaper, and more power-efficient servers. In the event that they can’t get all three, the combination of cheaper and more energy-efficient seems to be attractive to a large enough chunk of the market to have motivated Intel, AMD, and all their system partners to develop low-power chips and servers designed for high density compute and web/cloud environments. Up until now the debate was Intel versus AMD, and low power meant a CPU with four cores and a power dissipation of 35 – 65 Watts.
The Promised Land
The performance trajectory of processors that were formerly purely mobile device processors, notably the ARM Cortex, has suddenly introduced a new potential option into the collective industry mindset. But is this even a reasonable proposition, and if so, what does it take for it to become a reality?
Our first item of business is to figure out whether or not it even makes sense to think about these CPUs as server processors. My quick take is yes, with some caveats. The latest ARM offering is the Cortex A9, with vendors offering dual core products at up to 1.2 GHz currently (the architecture claims scalability to four cores and 2 GHz). It draws approximately 2W, much less than any single core x86 CPU, and a multi-core version should be able to execute any reasonable web workload. Coupled with the promise of embedded GPUs, the notion of a server that consumes much less power than even the lowest power x86 begins to look attractive. But…
Intel today officially announced the first products based on the much-discussed Sandy Bridge CPU architecture, and first impressions are highly favorable, with my take being that Sandy Bridge represents the first step in a very aggressive product road map for Intel in 2011.
Sandy Bridge is the next architectural spin after Intel’s Westmere shrink of the predecessor Nehalem architecture (the “tick” in Intel’s famous “tick-tock” progression of architectural changes followed by process shrink) and incorporates some major innovations compared to the previous architecture:
Minor but in toto significant changes to many aspects of the low-level microarchitecture – more registers, better prefetch, changes to the way instructions and operands are decode, cached and written back to registers and cache.
Major changes in integration of functions on the CPU die – Almost all major subsystems, including CPU, memory controller, graphics controller and PCIe controller, are now integrated onto the same die, along with the ability to share data with much lower latency than in previous generations. In addition to more efficient data sharing, this level of integration allows for better power efficiency.
Improvements to media processing – A dedicated video transcoding engine and an extended vector instruction set for media and floating point calculations improves Sandy Bridge capabilities in several major application domains.
I’ve recently had the opportunity to talk with a small sample of SLES 11 and RH 6 Linux users, all developing their own applications. All were long-time Linux users, and two of them, one in travel services and one in financial services, had applications that can be described as both large and mission-critical.
The overall message is encouraging for Linux advocates, both the calm rational type as well as those who approach it with near-religious fervor. The latest releases from SUSE and Red Hat, both based on the 2.6.32 Linux kernel, show significant improvements in scalability and modest improvements in iso-configuration performance. One user reported that an application that previously had maxed out at 24 cores with SLES 10 was now nearing production certification with 48 cores under SLES 11. Performance scalability was reported as “not linear, but worth doing the upgrade.”
Overall memory scalability under Linux is still a question mark, since the widely available x86 platforms do not exceed 3 TB of memory, but initial reports from a user familiar with HP’s DL 980 verify that the new Linux Kernel can reliably manage at least 2TB of RAM under heavy load.
File system options continue to expand as well. The older Linux FS standard, ETX4, which can scale to “only” 16 TB, has been joined by additional options such as XFS (contributed by SGI), which has been implemented in several installations with file systems in excess of 100 TB, relieving a limitation that may have been more psychological than practical for most users.
I just spent some time talking to ScaleMP, an interesting niche player that provides a server virtualization solution. What is interesting about ScaleMP is that rather than splitting a single physical server into multiple VMs, they are the only successful offering (to the best of my knowledge) that allows I&O groups to scale up a collection of smaller servers to work as a larger SMP.
Others have tried and failed to deliver this kind of solution, but ScaleMP seems to have actually succeeded, with a claimed 200 customers and expectations of somewhere between 250 and 300 next year.
Their vSMP product comes in two flavors, one that allows a cluster of machines to look like a single system for purposes of management and maintenance while still running as independent cluster nodes, and one that glues the member systems together to appear as a single monolithic SMP.
Does it work? I haven’t been able to verify their claims with actual customers, but they have been selling for about five years, claim over 200 accounts, with a couple of dozen publicly referenced. All in all, probably too elaborate a front to maintain if there was really nothing there. The background of the principals and the technical details they were willing to share convinced me that they have a deep understanding of the low-level memory management, prefectching, and caching that would be needed to make a collection of systems function effectively as a single system image. Their smaller scale benchmarks displayed good scalability in the range of 4 – 8 systems, well short of their theoretical limits.
My quick take is that the software works, and bears investigation if you have an application that:
Either is certified to run with ScaleMP (not many), or one where that you control the code.
You understand the memory reference patterns of the application, and
On Dec. 2, Oracle announced the next move in its program to integrate its hardware and software assets, with the introduction of Oracle Private Cloud Architecture, an integrated infrastructure stack with Infiniband and/or 10G Ethernet fabric, integrated virtualization, management and servers along with software content, both Oracle’s and customer-supplied. Oracle has rolled out the architecture as a general platform for a variety of cloud environments, along with three specific implementations, Exadata, Exalogic and the new Sunrise Supercluster, as proof points for the architecture.
Exadata has been dealt with extensively in other venues, both inside Forrester and externally, and appears to deliver the goods for I&O groups who require efficient consolidation and maximum performance from an Oracle database environment.
Exalogic is a middleware-targeted companion to the Exadata hardware architecture (or another instantiation of Oracle’s private cloud architecture, depending on how you look at it), presenting an integrated infrastructure stack ready to run either Oracle or third-party apps, although Oracle is positioning it as a Java middleware platform. It consists of the following major components integrated into a single rack:
Oracle x86 or T3-based servers and storage.
Oracle Quad-rate Infiniband switches and the Oracle Solaris gateway, which makes the Infiniband network look like an extension of the enterprise 10G Ethernet environment.
Oracle Linux or Solaris.
Oracle Enterprise Manager Ops Center for management.