The Need For Speed – GPUs Emerge As Mainstream Accelerators

It’s probably fair to say that the computer community is obsessed with speed. After all, our people buy computers to solve problems, and generally the faster the computer, the faster the problem gets solved. The earliest benchmark that I have seen is published in “High Speed Computing Devices, Engineering Resource Devices, McGraw Hill, 1950.” They cite the Marchant desktop calculator as achieving a best-in-class result of 1,350 digits per minute for addition, and the threshold problems then were figuring out how to break down Newton Raphsen equation solvers for maximum computational efficiency. And so the race begins…

Not much has changed since 1950. While our appetites are now expressed in GFLOPs per CPU and TFLOPS per system, users continue to push for escalation of performance in numerically intensive problems. Just as we settled down to a relatively predictable performance model with standard CPUs and cores glued into servers and aggregated into distributed computing architectures of various flavors, along came the notion of attached processors. First appearing in the 1960s and 1970s as attached mainframe vector processors and attached floating point array processors for minicomputers, attached processors have always had a devoted and vocal minority support within the industry. My own brush with them was as a developer using a Floating Point Systems array processor attached to a 32-bit minicomputer to speed up a nuclear reactor core power monitoring application. When all was said and done, the 50X performance advantage of the FPS box had decreased to about 3.5X for the total application. Not bad, but a defeat of expectations. Subsequent brushes with attempts to integrate DSPs with workstations left me a bit jaundiced about the future of attached processors as general purpose accelerators.

Fast forward to today. After a discussion with my colleague James Staten and a conversation with a large client in the petrochemical industry, I decided to do some digging into the use of GPUs as application accelerators. Originally designed to apply large numbers of parallel cores to an assortment of graphics rendering operations, within the last decade developers began to look at applying these devices to accelerating general-purpose computational problems as well. The last three years have seen significant and, even in the context of an industry where superlatives get worn out over time, startling progress. Major developments include:

  • Multiple generations of devices now optimized for use as computational accelerators, providing hundreds of cores with full floating point computation capability.
  • Architectures that remove many of the data transfer bottlenecks that reduced the effectiveness of earlier designs.
  • Most importantly, programming languages and development tools that bring effective use of GPUs within reach of moderately skilled programmers who understand the algorithms they wish to solve, as opposed to requiring very skilled specialists in parallel computational algorithms.
  • All of this reinforced by a growing body of user references across multiple industries and applications

These changes have resulted in increased support from both the ISV as well as the systems hardware communities, with selected ISVs now supporting GPU acceleration and mainstream server vendors IBM and Dell (can HP be far behind?) as well as a roster of smaller specialized vendors such as Bull, Cray, Tyan, Apro and SuperMicro offering servers with pre-integrated GPUs. The current market leader is NVIDIA, with its Tesla line of accelerators. AMD, which purchased Nvidia competitor ATI, has announced its Fusion integrated CPU/GPU architecture, which it calls an Accelerated Processing Unit (APU).

User testimonials abound, with quotes of 50 – 150x performance improvements in important applications. Note that these results aren’t instant or free. Expect to invest 10 – 20X the time needed to develop an algorithm in a simple HLL for a GPU-based solution, and that you will need to incrementally refine it, since simulation and analysts tools for these applications and systems are in their infancy.

So, while more and faster cores will continue to be an option, and one that Intel appears to be pursuing aggressively, GPUs present another option in the eternal quest for speed that has consumed our collective consciousness since some nameless craftsman sat down and polished the rods on his abacus to make the beads slide a bit faster.

So please tell us. Are you planning to use or are you interested in using GPU technology? What kind of applications? What kind of platform?


HPC and visual search

A few years back I looked at GPU for financial services HPC applications (along with other approaches like Cell, ClearSpeed, FPGA etc.). There's no denying that there was plenty of performance on the table, but the developer productivity costs always pushed us back to commodity x86 (when your developers are specialist quant guys who cost north of $250k/yr the balance between system productivity and developer productivity becomes quite one sided).

More recently I've seen GPU being used for visual search. Check out

Agree, but ...

I generall agree (see comment/response below) but for high value apps hey can be pretty amazing.

One of the major changes over the last two years has been imporvement of dev environment and growth in ecosystem, all of which will help to push the dev cost down from "astronomical" to meerly "very high". Even at $250K/year, there are still some apps that are absolute no-brainers.

Simple rule of thumb - if the value of investing 10 - 20x in the core solver for your problem doesn't jump out at you, it's probably not worth doing :)

Most of us

"Most of us don't need something like this." I suspect many would say that. The reality is that the truth is "Most of us can't USE something like this." The legacy of the IT industry is that parallel processing was for only a specialized few with expensive skills. Now look at the number of people buying multi-core CPUs from Intel, et al and then finding out that 2 x 2.1 GHz is NOT as fast as a single 3.0 GHz processor, except, again, in the hands of a specialized few.

All of these makers - Intel especially - need to be working with applications vendors to actually be able to use more than one processor at a time. Most CPUs out there in the wild are not powering servers - they are sitting in front of people who are disappointed that their fancy new x-core PC is slower than molasses.

For example, I develop web sites. Eventually, these will be on servers, but not until I am through developing. Pretty much all content management systems are actually single-threaded and don't make use of multiple CPUs - at least not for a single session (as in "people"). They only make use of more than one processor when multiple sessions are open. That's where my problem comes into this. I am a single processor (brain) developer. I am far better off with a single 3.0 GHz (or better) machine than a dual-core 2.1 GHz machine. When software vendors begin developing applications that can actually use multiple CPUs, then all those x-core processors will blossom. Until then, the CPU makers should push faster single-core machines just as hard as the multi-core machines.

Once CMS's, or medical billing systems, or any of the multitude of applications suites, start being accelerated by 2 cores, let alone hundreds, the market will be limited and disappointing (to vendors and users alike), and prices will, of necessity, remain high. Thomas Edison's philosophy was to make people think they needed something and then give it to them. PC vendors need to do the same, instead of having solutions in search of problems.

"Most of us" is correct ... but the niche payofs can be huge

I strongly agree - GPUs are not for most of us, and for the kind of applications which you reference in your comment, probably never.

The big wins are applications with inherently parallel algorithms where calculations are carried out in parallel on elements of a large data structure that can be managed in the memory of the GPU. Think image processing types of apps where the image, be it seismic, medical, electronic intelligence or security camera images, is stored on the GPU and a gazillion calculations are done to produce a single consolidated output. Another case is financial risk analysis and pricing, where a multi-GB chunk of historical data is in memory and a bunch of calculations are then done to evaluate risk, price, decision to take a trade, etc. Another characteristic of good apps is that that the operation has to be done many times, and that it have a high economic payback.

For these apps, the returns can be spectaculer and worth the major (factor of 10 - 20) extra development time. My favority poster child is a medical app at a major university hospital for using low-dose imaging to guide radiology procedures, Previous version was a batch operation, order of 1 hr, useful for planning but not for realtime use during procedures. GPU version was approximately 150x performance, making it feasible to use as a real-time aid during the procedure, lowering patient x-ray dosage.

Take a look at this site...

kind of applications which you reference in your comment, probably never

On the right hand side you see a series of blocks that are regenerated on every page request. On the front page, you see, for every post, a comment count and other stuff - again, generated for every page request. Most of that comes from database queries, some of which are rather complex. While each of those items is being built, nothing else is happening to give you what you are mostly interested in - displaying the content. And most of them really have little to nothing to do with the content, or even each other.

If all that could be farmed out to separate processes on separate CPUs, the whole page will render much faster and you will get what you're looking for much faster. Now, couple that with search engines ranking by delivery speed, and there is value in multi-tasking/multi-computing on even the most mundane of chores.

As a physics and computer science graduate, I fully understand the value to numerically intensive computing. My senior paper was on ion channeling in silicon crystals (this was before they started creating semiconductors that way). Analyzing the data by hand took a week; after I wrote programs to do it on the mainframe, we could do it in under a day. Had the technology of today been available, the data analysis could have been spit right out of the data collection apparatus on the spot.