SIMD | Jan Can Blog

Google Scholar reports 303,000 hits for ‘GPU’! Are we really seeing enough benefits to justify the precious R&D time sacrificed on that altar?

Taking a step back, the mainstream architectures include FPGAs (programmable hardware), GPUs (simplified and specialized co-processors) and general-purpose CPUs with SIMD. It is important to choose wisely because switching between them is expensive.

FPGAs are probably the fastest option for integer operations. Algorithms can be cast into specially designed hardware that provides exactly the operations required. Against the objection that the FPGA structure is less efficient than truly hard-wired circuits or signal processors, we note that some FPGAs actually do contain fixed-function signal-processing hardware or even full CPUs. However, despite the existence of C-to-Hardware-Design-Language translators/compilers, development is still slow. Implementing and debugging a new feature involves a comparatively lengthy synthesis process, and the tools remain sub-optimal. For R&D work, flexibility and rapid prototyping are more important. Casting them into hardware diverts development resources. Fast develop/debug/test cycles, and automated self-tests are more helpful. As a consequence, an initial software version should be developed anyway.

The more difficult question is whether to grow that into a GPU solution, or remain in the CPU. The boundaries are somewhat blurred as both architectures continue to converge. GPUs are nearly full-featured processors and have begun integrating caches, whereas CPUs are moving closer to GPU-scale parallelism (SIMD, multicore, hyperthread).

However, one point in favor of current CPUs is that they come with far greater amounts of memory. Some algorithms require multiple gigabytes, which exceeds the capacity of most GPUs (though what memory they do have is very fast). Although high-end GPUs have recently been introduced by AMD and Nvidia, these are niche offerings. Deployment will not come cheap (about $3000) and is likely to involve vendor lock-in.

These well-known points aside, the more interesting topic concerns the programming model of GPUs vs CPUs. Although the underlying hardware is increasingly similar, GPUs expose a simplified “SIMT” model that hides the ‘ugly’ details of SIMD. Programs are written to process a single element at a time and the compiler and hardware magically coalesce these into vector operations. By contrast, CPUs require programmers to deal with and be aware of the vector size. Thus, the argument goes, GPUs are “easier” to program.

The Emperor’s New Clothes: Performance portability

At conferences since 2008, I have seen speedups of 10-100x proudly reported. However, the amount of ‘tuning’ the code to a particular GPU is quite astonishing to this veteran of assembly language programming. Minutiae such as memory bank organization, shared-multiprocessor occupancy, register vs. shared memory, and work group size are enthusiastically discussed, even though they change every six months.

“Performance portability” (the hope that a program will run similarly fast on a different GPU) remains an open research problem, with some disappointing results so far:

“portable performance of three OpenCL programs is poor, generally achieving a low percentage of peak performance (7.5%–40% of peak GFLOPS and 1.4%-40.8% of peak bandwidth)” [Improving Performance Portability in OpenCL Programs]

There is some hope that auto-tuning (automatically adapting certain parameters to the GPU at runtime and seeing what works best) can help. In a way, this is already an admission of defeat because the entire approach is based on tweaking values and merely observing what happens. However, for non-toy problems it becomes difficult to expose parameters beyond trivial ones such as thread count:

“vulnerability of auto-tuning: as optimizations are architecture-specific, they may be easily overseen or considered “not relevant” when developing on different architectures. As such, performance of a kernel must be verified on all potential target architectures. We conclude that, while being technically possible, obtaining performance portability remains time-consuming. [An Experimental Study on Performance Portability of OpenCL Kernels]

Chained to the treadmill

This sounds quite unproductive. Rather than develop better-quality algorithms, we have chained ourselves to the treadmill of tweaking the code for every new generation of GPU. By contrast, SIMD code written for the SSSE3 instruction set of 2006 remains just as valid eight years later. Some instructions are now slightly faster, but I am not aware of any regressions. When 64-bit reached the mainstream, a recompile gave access to more registers. When AVX was introduced, an optional recompile gave access to its more efficient instruction encodings. Even the much-maligned dependency on SIMD width often only requires changing a constant (integrated into vector class wrappers around the SIMD intrinsics).

By contrast, CUDA has seen 16 releases since 2007, with major changes to the underlying hardware, programming model and performance characteristics. With so many toolchain updates and tweaks for new hardware, it would seem that SIMD is actually less troublesome in practice.

More subtle flaws

Moreover, the major ‘convenience’ of SIMT, hiding the underlying hardware’s SIMD width, is actually a hindrance. Although cross-lane operations are often questionable (inner products should generally not be computed horizontally within an SIMD register), they should not be banished from the programming model. For example, a novel SIMD entropy coder reaches 2 GB/s throughput (faster than GPUs) by packing registers horizontally.

Another lesser-known limitation of the GPU hardware stems from its graphics pedigree. Most values were 32-bit floats, so the hardware does not provide more lanes/arithmetic units for smaller 8 or 16-bit pixels. Even the memory banks are tailored specifically to 32-bit values; 8-bit writes would encounter more bank conflicts.

Perhaps the biggest flaw with GPGPU is that it is an all-or-nothing proposition. The relatively slow (far slower than CPU memory bandwidth) transfers between host and device over PCIe effectively require ALL of the processing to be done on the GPU, even if an algorithm is not suitable for it (perhaps due to heavy branching or memory use).

With a background in hardware architectures and experience in OpenCL as well as SIMD, it is difficult to understand the continued enthusiasm for GPUs. There is always a trade-off between brevity/elegance and performance. Tuning kernels, or even devising an auto-tuning approach, has real costs in terms of development time. In my experience, the cost of developing, testing, debugging and maintaining a large set of GPU kernels is shockingly high, especially if separate CUDA and OpenCL paths are needed (due to persistently lower OpenCL performance on NVidia hardware). It is surprising that GPUs are still viewed as a silver bullet, rather than as a tool with strengths and weaknesses. When considering the sum of all investments in development time and hardware cost, the benefits of GPUs begin to fade.

By contrast, it is faster and easier to design and develop a system that begins with software prototypes, and incrementally speeds up time-critical portions via SIMD and task-based parallelization (ideally by merely adding high-level source code annotations such as OpenMP or Cilk). This solution remains flexible and does not require an all-or-nothing investment. In my experience, 10-100x speedups are usually also achievable in software, with much less volatility across hardware generations.

For real-time (video editing) systems that really must have more bandwidth than even a dual-CPU system can provide, perhaps GPUs are still the better tradeoff. There is anecdotal evidence that the costs of FPGA development are even higher (though this may change as FPGA tools improve). However, I really hope scientists and engineers can escape the temptation of quick publications from adapting yesterday’s techniques to today’s hot new GPU. There are some hard questions that need asking:

Is the real-world application already viable without ‘flashy’ speedups?
Is the approach limited to toy problem sizes due to GPU memory?
Will it be invalidated by the next hardware generation anyway?
Is the GPU already over-taxed by other parts of the application?
Will sending data over PCIe erode the speed gains?
Does the total development and hardware cost eclipse any benefit to the bottom line?

If so, perhaps other worthy avenues of research can be explored.

Jan Can Blog

Category Archives: SIMD

Blind faith in GPUs and its cost

The Emperor’s New Clothes: Performance portability

Chained to the treadmill

More subtle flaws