Background — High Performance Attached Processors Handicapped By Architecture
The application of high-performance accelerators, notably GPUs, GPGPUs (APUs in AMD terminology) to a variety of computing problems has blossomed over the last decade, resulting in ever more affordable compute power for both horizon and mundane problems, along with growing revenue streams for a growing industry ecosystem. Adding heat to an already active mix, Intel’s Xeon Phi accelerators, the most recent addition to the GPU ecosystem, have the potential to speed adoption even further due to hoped-for synergies generated by the immense universe of x86 code that could potentially run on the Xeon Phi cores.
However, despite any potential synergies, GPUs (I will use this term generically to refer to all forms of these attached accelerators as they currently exist in the market) suffer from a fundamental architectural problem — they are very distant, in terms of latency, from the main scalar system memory and are not part of the coherent memory domain. This in turn has major impacts on performance, cost, design of the GPUs, and the structure of the algorithms:
Performance — The latency for memory accesses generally dictated by PCIe latencies, which while much improved over previous generations, are a factor of 100 or more longer than latency from coherent cache or local scalar CPU memory. While clever design and programming, such as overlapping and buffering multiple transfers can hide the latency in a series of transfers, it is difficult to hide the latency for an initial block of data. Even AMD’s integrated APUs, in which the GPU elements are on a common die, do not share a common memory space, and explicit transfers are made in and out of the APU memory.