AMD Quietly Rolls Out hUMA – Potential Game-Changer for Parallel Computing
Background – High Performance Attached Processors, Handicapped by Architecture
The application of high-performance accelerators, notably GPUs, GPGPUs (APUs in AMD terminology) to a variety of computing problems has blossomed over the last decade, resulting in ever more affordable compute power for both horizon and mundane problems, along with growing revenue streams for a growing industry ecosystem, most recently joined by Intel’s Xeon Phi accelerators which have to potential to speed adoption even further due to hoped-for synergies generated by the immense universe of x86 code that could potentially run on the Xeon Phi cores.
However, despite any potential synergies, GPUs (I will use this term generically to refer to all forms of these attached accelerators as they currently exist on the market) suffer from a fundamental architectural problem – they are very distant, in terms of latency, from the main scalar system memory and are not part of the coherent memory domain. This in turn has major impacts on performance cost, , design of the GPUs and the structure of the algorithms:
Performance – The latency for memory accesses generally dictated by PCIe latencies, which while much improved over previous generations, are a factor of 100 or more longer than latency from coherent cache or local scalar CPU memory. While clever design and programming, such as overlapping and buffering multiple transfers can hide the latency in a series of transfers, it is difficult to hide the latency for an initial block of data. Even for AMD’s integrated APUs,, in which the GPU elements are on a common die, they do not share a common memory space, and explicit transfers are made in and out of the APU memory.
The recent Executive Order