Architecture of OpenCLIPER
OpenCLIPER is designed as a set of classes which provide three main services to the developer:
- Computing device management.
- Data storage and manipulation.
- Algorithm handling.
These services are described below.
Computing device management
Maybe at a first glance, the most visible drawback of OpenCL is that the concept of the device is not handled automatically, as happens with CUDA. Quite the opposite, developers have to deal with device discovery and initialization explicitly, as well as with the particular capabilities of potential device classes available (i.e. a particular job may be execute optimally in a GPU or an FPGA but not in a CPU or viceversa). This is a price to pay for the versatility of OpenCL.
OpenCL also introduces the concept of platforms to address the problem of supporting different hardware vendors. So, there may be devices which are seen through one platform but not through another, devices which are seen once per platform (if more than one vendor supports them), or even devices which support a given OpenCL version through a given platform but another version through the other.
It is the developer's job to retrieve and traverse the list of available vendors and devices, and then choose the most appropriate one. With OpenCLIPER, one selects the desired device according to a combination of criteria (e.g. device class, vendor, supported OpenCL version, etc) in a single call.
OpenCLIPER targets OpenCL version 1.2, which is widely adopted by most significant vendors as of today. Although in the future we may use features from newer versions, we intend to continue supporting 1.2, at least until most vendor support a higher version.
Data storage and manipulation
Another problem with GPU programming (or, more generally, with dedicated device programming) comes from the fact that host and device memories have their own, separated, memory maps. Therefore, data need to be transferred from host to device before processing them, and from device to host afterwards. While this is a somewhat simple process in CUDA, it is not that simple in OpenCL, again, because of its versatility: the concepts of context and command queue begin to show themselves here, and things get more complicated when one tries to optimize transfers by using mapped host memory (pinned memory in CUDA terms). Put all this the OpenCLIPER way: a single call suffices to transfer a data set to/from the computing device, which is in turn automatically pinned for you.
Speaking of data, other image processing and reconstruction approaches typically provide a data structure to contain a single N-dimensional array, and the developer has to keep track of all them. But what if the data to be processed consists of heterogeneous data? Consider, for example, a 3D+t volume and a synchronization signal, several 3D volumes coming from several sensors, or even several ND volumes coming from several sensors, each one with various synchronization signals of their own. Typically one would have to create their own ad-hoc data structures and handle data transfers by hand. OpenCLIPER, however, is agnostic about the internal data organization, so one can create structures as complex as needed and have it transferred to/from the device in a single call. Even for arbitrary-dimension data? Yes. Even for data in the complex plane? Yes. Even for mixed integer/string/float/complex, arbitrary-dimension data? Oh, yes!
But then you may say: OK, but then I will lose control of the memory layout of my data in the computing device, right? Wrong! With OpenCLIPER, a single data set is always linear and contiguous in the computing device, even when it consists of highly heterogeneous data. This way, data can be processed in batches because the starting position of each component in a given data set is known in advance, and readily available from OpenCL kernels. No need to keep track of the data yourself. No need to keep track of data sizes yourself.
But I still have to load the data from disk and adapt them to OpenCLIPER's format, you may say again, but you'd be wrong once more! OpenCLIPER supports many image formats (through the DevIL library), and volumes in Matlab's .mat format. We have plans to support the ISMRMRD format as well. But even if your format is not supported out-of-the-box, you just have to derive from the appropriate class, as usual. In addition, OpenCLIPER supports raw data files too.
In OpenCL, it is usually considered burdensome as well to load the kernels, compile them, check for possible errors and reporting them, keep track of them at run time, etc. One has to keep in mind the concept of programs too, which are different from kernels. Once again, with OpenCLIPER you can do all this in a single call, even if they are scattered in multiple files, and have all your kernels readily available by name.
OpenCLIPER has also been designed to ease the job of launching kernels, even if they are very different in nature. To this end, the concept of process is introduced. Processes have a customizable but standard calling interface, so that any work is treated the same way: briefly, set their input and output data sets, set their parameters, and launch. Since it is quite common that kernels need some kind of initialization before doing the real job, and this initialization may be costly (e.g. computing a lookup table, a plan for an FFT or ordering the input data), OpenCLIPER separates initialization and the proper of launching the kernel, so that performance is not compromised. Processes can be chained at no cost, of course (setting outputs from a stage as inputs for the next one is zero-copy).
On top of this, since existing approaches for image processing and reconstruction are usually based on CUDA, one needs to decide at compile time whether the computing device will be the CPU or the (nVidia) GPU. This translates in that, typically, one has to duplicate code and data structures, one version for the CPU and another for the GPU. With OpenCL (and, consequently, with OpenCLIPER) the computing device is chosen at run time and the code and data structures are the same, independently of what the chosen device is. No need to maintain different code to support different device classes.