CPSC 521 algorithm assignment
The paper I looked at for this assignment is "A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers", IPDPS 2012. The basic idea of this paper is to take advantage of the memory hierarchy of GPUs to solve small linear algebra problems very efficiently. Specifically, the aim is to fit as much data as possible directly into GPU registers.
The authors solve two kinds of linear algebra problems: QR factorization using Householder transformations, and LU decomposition without pivoting. These are both methods of solving a system of equations. They are implemented in two ways: by assigning one problem to each thread, and by assigning one problem to each thread block (collection of threads with low latency in their shared-memory communication). It is assumed that there are a very large number of problems to solve, and they can be scheduled in a fairly embarrassingly parallel manner.
All timings and code in this paper were written for a high-end graphics card, the NVIDIA Quadro 6000 (which uses the GF100 architecture). The code is written in CUDA, NVIDIA's parallel programming language for general computing. The hardware has 448 CUDA cores available, all of which are utilized here.
GPU performance model
The paper builds a sophisticated model for the GF100 architecture, including three main parameters: , synchronization cost between threads; , cost of memory bandwidth; and , time per floating-point operation (FLOP). They perform tests to figure out the practical maximums for the hardware (when measuring memory bandwidth, they manage to significantly out-perform the builtin cudaMemcpy(), by a factor of 1.28). And they run the same tests on older hardware to show that the performance of the tests matches published results.
All this to characterize the hardware, and also to figure out how close to optimal the algorithms they develop are.
One problem per thread. The first algorithm, one problem per thread, is very simple (embarrassingly parallel). The entire matrix is copied into the register file for one thread, the problem is solved, and the results are copied back. For matrices of size 7x7 or smaller, this works perfectly (and the authors even claim that the algorithm is optimal, based on memory bandwidth and so on from their model). For size 8x8 the data cannot all fit into registers and it starts going into L1 and DRAM. But even up to size 16x16 this setup still out-performs standard CPU+GPU techniques as in Intel's MAGMA.
One problem per block. The solution to the issues with one problem per thread is to involve more than one thread. By assigning (for example) 256 threads together in a block, single-precision matrices of up to 112x112 will fit into registers. Threads communicate between each other to solve the linear algebra problem, but because all the data is in registers, the cache and the DRAM video memory do not need to get involved.
The paper indicates several possible mappings of the matrix to threads. Assigning rows or columns to threads turns out to be significantly less efficient than splitting up the matrix into equal-sized blocks. (Maybe there's hope for assignment 2, the matrix version, after all.)
Unrolling loops and code generation
One interesting point of this paper is how the code handles the dimensions of matrices (for LU and QR). The matrix size must be known and constant, say 192x96; their code will unroll the loops completely, creating a flat-line program that can only solve the problem for 192x96 matrices. This is actually necessary when one is using GPU registers, because an opcode must refer to a specific register and not some register indexed by a variable. Code cannot actually iterate through different registers in a loop without fully unrolling the loops and hardcoding the register references.
In this paper's implementation the size is determined at compile-time, with templates. This makes sense for some specific size that the code must handle, but it would be difficult to adapt the code to handle several different (small) sizes. Unless the code is not determined at compile-time. In my opinion this idea would be a perfect use of code generation. A linear algebra library could see that you're multiplying lots of 192x96 matrices, generate the GPU machine code (or shader or CUDA or OpenCL!) on the fly that would solve that problem size, and then start running it. This practice is not unheard of for shaders used in computer graphics, and I think this paper's method would benefit a lot from such code generation. It might then be flexible enough to put into a generic linear algebra library.
The performance curves for these two algorithms are very ragged. This is because they are highly dependent on the architecture of the GPUs, especially the latency between memory operations. See below:
Image taken from page 11 of "A Predictive Model for Solving
Small Linear Algebra Problems in GPU Registers", Anderson et al.
However, the methods are clearly an improvement over the existing hybrid CPU+GPU techniques at certain matrix sizes. One problem per thread is particularly good for 7x7 or 8x8 matrices, and one problem per block peaks at size 56x56. At this size the paper reports speed improvements of 2.8x (when competing against CPU approaches) to 140x faster than existing (CPU+GPU) techniques.
Factors that speed up the algorithms
Not really the point of this writeup but I thought I should mention it. The authors speed up LU decomposition by not doing pivoting, which may cause numerical issues. However, pivoting would require swapping rows and therefore transferring data between threads. This would slow down the system significantly and would add some variation on how long each problem took to solve. So despite being less numerically stable, it makes sense to forgo pivoting in this case. (In any case, the more heavyweight QR decomposition can be used to solve a system of equations if LU decomposition without pivoting fails.)
The authors also manage to speed up their computations by taking advantage of the hardware. Firstly, they use single-precision floating point, which allows more data elements to fit into the register file. In addition, they utilize a built-in square function implemented in hardware, which is accurate to 22 bits of mantissa (out of 23+1 mantissa bits for single-precision floating point). Without this simple optimization, their code runs 30% slower in some cases. This is important because the libraries these results are compared against (Intel's MKL and MAGMA) use full square root functions! So the 140x faster number is not an entirely fair comparison, although it's probably only about 30% too optimistic.
For large numbers of small matrices, the method presented in the paper is a good idea. Instead of using normal hybrid CPU+GPU techniques, just copy all the data into registers. Assign multiple threads to solve the problem if the number of registers required exceeds the register file size of one thread. This works well since the bottleneck for small matrices is transferring memory from DRAM to registers, not the actual computations.
The method is embarassingly parallel in the one-problem-per-thread case, and even with one-problem-per-block it relies on batching solutions together to keep the hardware busy. Still, this is really all the parallelism required if you really do have small matrices to solve, which apparently comes up in cases like radar processing.