CPSC 521 language assignment
OpenCL is a language designed to allow GPUs (graphics processing units) to be used outside of a graphics context. Modern GPUs are fully programmable and can be used for scientific computing or other applications that can make use of highly parallel cores. GPUs are very fast; modern six-core CPUs can reach around 100 gigaflops while high-end GPUs can reach more like 3.5 teraflops [NVIDIA Tesla K20] ... and it's quite possible for a machine to have multiple graphics cards.
For this writeup I heavily referenced the book Heterogeneous Computing with OpenCL, by Gaster et al. If I make claims without supplying references, they are (hopefully) mentioned in this book.
Similarity to shaders
OpenCL is really just a library which you can link with C or C++ programs -- and a specification for a C-like language for code ("kernels") to run on other devices. The source code for an OpenCL kernel is passed at runtime to a compiler which is part of the graphics driver, and compiled to a target language for the specific hardware present. This allows the code to work on hardware from (e.g.) both AMD and NVIDIA, which use very different instruction sets. And it means that OpenCL works without pragmas or modifying the original source language (C/C++).
This design mirrors how shaders work in OpenGL and DirectX (shaders are specified in source form and passed at runtime to a compiler). There are different shader languages for specific platforms -- e.g. Cg is designed by NVIDIA, and compiles to AMD cards less efficiently. Similarly, several vendor-specific languages have arisen for GPU computing, most notably CUDA by NVIDIA. OpenCL tries to be a generic GPU computing language, and in part it succeeds; in OpenCL the source can be passed directly, or it can be precompiled into an intermediate binary format for AMD or NVIDIA architectures. However, like shader languages, OpenCL is not simple, and its design is very much a product of the rapidly evolving GPU architectures. A language that tries to target all of this must have a great deal of flexibility.
A generic language
When an OpenCL program first starts, it must perform many steps before actually running kernel code:
- query and select platforms, which are implementations of OpenCL, in case there are (say) AMD and Intel devices on the same system;
- query and select devices, namely places where code can be run, which can be CPU or GPU cores or combinations thereof;
- compile ("build") the code for OpenCL kernels for particular devices;
- transfer data between the host system and (say) graphics memory; and
- finally, invoke OpenCL kernels asynchronously, and possibly go do more work.
This flexibility comes with a cost. Almost all OpenCL functions take at least seven parameters, and most appear to take closer to nine. The C++ wrapper supplied by the major vendors is more managable. But the concepts and what the functions actually do is complex, so using them is necessarily not an easy task.
The way OpenCL does parallelism is to invoke the same kernel code for each element of a 1-, 2-, or 3-dimensional space. The kernel can query its indices along these dimensions (and the size of each dimension), much as a program looks at the return value of fork() or an MPI program looks at its rank to decide what to do. Often for OpenCL the indices will indicate where in an array the kernel should do its processing.
Because there may be a large number of GPU cores available, it may be advantageous to make the parallelism as fine-grained as possible. (Also, if the number of work items is a multiple of the number of cores, this may be more efficient since GPU cores are heavily SIMD-based.) Each work-item will be mapped to a core, and each core can handle as many work items as necessary, like loop chunking in OpenMP.
Memory structure and memory consistency
Kernels running on actual GPUs will often be in lockstep, since GPUs make use of SIMD lanes. However presumably different processing units can get out of sync by a few clock cycles in either direction. OpenCL does not guarantee the consistency of any (non-constant) memory accessed by multiple tasks, until the kernels have all finished executing. But OpenCL does provide a simple synchronization barrier primitive, forcing all executing kernels to reach that point before continuing execution (essentially enforcing consistency at that particular moment), in case kernels depend on values computed by their cohorts.
The barrier will be a no-op in cases where all tasks are scheduled for the same SIMD unit.
In OpenCL there are many different types of memory, and the programmer has to be concerned with where data is sitting and with transferring data between memories. Besides the "host" memory in the original program, there is global memory, which typically sits in GPU video memory where all kernels can access it; local memory, which can be accessed by all kernels in the same workgroup (part of the same N-dimensional range); and private memory, which can only be read by one core (this may be in the register file). OpenCL provides commands for migrating data as necessary, but it is quite programmer-intensive. It's an interesting blend of message passing and shared memory, in that you issue commands/messages in order to share memory.
Remember that OpenCL is largely asynchronous: the host process can start many kernels executing in parallel, without knowing necessarily which ones will finish first. The other main component of OpenCL is the dependency graph. This is constructed explicitly with pointers and indicates which kernels must finish before any given kernel can begin executing.
OpenCL has many other features relating to the dependency graph: the ability to register callback functions when tasks complete, and the ability to make a node in the dependency graph which is really native C code, executed on the host machine.
A final point about the dependency graph: it is interesting that a very similar structure is used within compilers to do flow optimizations, and at the hardware level to do instruction scheduling. This technology makes sense for hardware and compilers, but now it has been pushed as high the application level, where OpenCL code must specify the graph directly. It makes me think there's scope for a compiler for a simpler language, which would generate lower-level (OpenCL) code, which could be compiled by vendor-specific compilers to target specific hardware. Or a smarter, unified compiler. The current setup is tedious, but very powerful, if you can discover how to take advantage of it!
OpenCL allows code to be pushed onto CPU and GPU cores and other devices. It's fairly low-level, with details of memory movement and dependencies needing to be specified by the programmer -- but powerful as well, in that you should be able to use many different kinds of devices and GPUs from different vendors. Even GPUs that have yet to appear. Parallelism is mostly data parallelism and each node should be executing similar code, because a common implementation would be to map the work items to SIMD processors running in lockstep. In general the language may be tricky to use, but if you have the right kind of problem you can harness the power of graphics processors on your system which are probably not even being used, such that the code should be portable to other future platforms.
- opencl-hello.tar.gz (10 KB): A hello-world program written in OpenCL (decodes a Caesar-cipher encoded string). Based on HelloWorld_Kernel.cl from the AMD SDK and also the first program in the book by Gaster et al (pages 32-38).