Tutorial

Please have a look at the architecture page if you haven't done so yet, to see what OpenCLIPER can do for you, and then come back again to start coding.

Overview

The goal of OpenCLIPER is to simplify writing OpenCL code. It will not make your code faster, nor more parallel, nor more bug-free. However, it will allow you to focus on coding kernels, because all the plethora of housekeeping work which is typically needed by OpenCL applications is taken care of automatically.

Note: from now on, we assume you have at least a very basic knowledge of how OpenCL works.

Tutorial: Intensity inverting filter

In this tutorial, we will write a first, simple program which just reverses the pixels of an image and, for the sake of demonstration, we will be comparing OpenCLIPER code with its pure-OpenCL equivalent.


Intensity inverting in C

First, let's see a pure C implementation. Because all the code runs in the CPU, there is no need to-do anything special and the source ends up being pretty short:

  1. Load an input image. There are plenty of libraries that can do this. We will use OpenCV, for instance (as OpenCLIPER does internally):

    cv::Mat image;
    // Have OpenCV convert color images to grayscale
    // Also, convert 16/32-bpp images to 8-bpp (OpenCV does this by default).
    image=cv::imread("input.png", cv::IMREAD_GRAYSCALE);
    unsigned int width = image.cols;
    unsigned int height = image.rows;
    

    That gives us the image in the image object. For brevity, error checking is omitted.

  2. Do the real work, i.e. reverse the pixels in the image. Let's assume the image is one byte per pixel:

    for(unsigned int i = 0; i < width * height; i++)
        image.data[i]=255-image.data[i];
    

    Of course, this is not optimized in any way but we are not measuring performance here.

  3. Save the output image. Again, let's use OpenCV to do the dirty work:

    cv::imwrite("output.png", image);
    


Intensity inverting in OpenCL vs. Intensity inverting in OpenCLIPER

Well, that was simple. Next, we'll do the same thing in OpenCL. As you know, the processing work is written in a separate kernel file. Kernels are natively parallel, so we just need to specify how to transform a single pixel and then let OpenCL spread the work throughout all possible ALUs, like this:

__kernel void negate_kernel(__global uchar* input, __global uchar* output) {
    uint i = get_global_id(0);
    output[i] = (255 - input[i]);
}

But the tricky part is how to tell the processing device to execute that code. This is where OpenCLIPER comes in to help. Don't trust us; see it for yourself.


1a) Initializing the device in OpenCL

  1. With OpenCL, we need at least one platform available, so let's look for any of them and choose the first one we find:

    cl_uint numPlatforms;
    cl_platform_id platform = NULL;
    cl_int status = clGetPlatformIDs(0, NULL, &numPlatforms);
    if (status != CL_SUCCESS) {
        fprintf(stderr,"Error: Getting platforms!\n");
        return 1;
    }
    
    if(numPlatforms > 0) {
        cl_platform_id* platforms = (cl_platform_id* )malloc(numPlatforms* sizeof(cl_platform_id));
        status = clGetPlatformIDs(numPlatforms, platforms, NULL);
        platform = platforms[0];
        free(platforms);
    }
    

    It is not very elegant to pick the first available platform just because it was the first. Maybe we are missing an interesting device which is only present in, say, the fifth available platform.

  2. Now we need the list of devices in the chosen platform. We'll try to get a GPU-class device and, if there aren't any (remember: in the platform we chose above), then fall back to the CPU (again, if this platform supports it):

    cl_uint numDevices = 0;
    cl_device_id *devices;
    status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &numDevices);
    
    if (numDevices == 0) { //no GPU available.
        fprintf(stderr,"No GPU device available.\n");
        fprintf(stderr,"Choose CPU as default device.\n");
    
        status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 0, NULL, &numDevices);    
        devices = (cl_device_id*)malloc(numDevices * sizeof(cl_device_id));
        status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, numDevices, devices, NULL);
    }
    else {
        devices = (cl_device_id*)malloc(numDevices * sizeof(cl_device_id));
        status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, numDevices, devices, NULL);
    }
    
  3. Last for device initialization, we need an OpenCL context and a command queue associated to the context and the device we found:

    cl_context context = clCreateContext(NULL,1, devices,NULL,NULL,NULL);
    cl_command_queue commandQueue = clCreateCommandQueue(context, devices[0], 0, NULL);
    

With this, device initialization is over, but keep in mind that we are possibly missing some interesting devices, and that we did not care about any device properties. If we need a device that supports, say, OpenCL 1.1 or want to try the platform from vendor X, we have to throw more code at it.


1b) Initializing the device in OpenCLIPER

That's it for device selection and initialization. On any error, an exception is raised accordingly, so no need to check explicitly after each call.


2a) Loading the kernel in OpenCL

Next, we need to load the kernel that negates the pixels. Assume we saved it in a file called negate.cl. Then:

  1. We need to load the kernel into memory as a string, and then create a CL program from it:

    char *source;
    FILE *f;
    struct stat fileInfo;
    size_t fileSize;
    
    f=fopen("negate.cl","r");
    fstat(f->_fileno,&fileInfo);
    fileSize=fileInfo.st_size;
    source=(char*)malloc(fileSize+1);
    fread(source,sizeof(char),fileSize,f);
    fclose(f);
    source[fileSize]='\0';
    
    size_t sourceSize[] = {strlen(source)};
    cl_program program = clCreateProgramWithSource(context, 1, (const char**)&source, sourceSize, NULL);
    
  2. Now we need to compile the program, dump the log to the screen (just in case anything went wrong) and then create the kernel object.

    status=clBuildProgram(program, 1,devices,NULL,NULL,NULL);
    printf("status_build=%d\n",status);
    char *info=(char*)malloc(1000000);
    clGetProgramBuildInfo(program,devices[0],CL_PROGRAM_BUILD_LOG,1000000,info,NULL);
    printf("log=%s\n",info);
    
    cl_kernel kernel = clCreateKernel(program,"negate_kernel", NULL);
    

With this, we have created one kernel object, which we must keep track of by hand. Now, for the OpenCLIPER variant.


2b) Loading the kernel in OpenCLIPER

And that's all. The CL file has been loaded, the CL program created and compiled, and a kernel object has been created automagically for every kernel present in the source file(s), which we can access by name anywhere in our code. If there were any errors, an exception is raised with the build log.


3a) Setting inputs and outputs in OpenCL

  1. First we have to load the input image and copy it to device memory. We can use OpenCV again to load the image:

    cv::Mat inputImage;
    inputImage = cv::imread("input.png", cv::IMREAD_GRAYSCALE);
    unsigned int width = inputImage.cols;
    unsigned int height = inputImage.rows;
    
    cl_mem inputDevBuffer = clCreateBuffer(context, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR, width * height * sizeof(char), (void *) inputImage.data, NULL);
    
  2. Now reserve another chunk of memory for the output, in the host and in the device:

    outputImage = cv::Mat(width, height, CV_8U);
    cl_mem outputDevBuffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY , width * height * sizeof(char), NULL, NULL);
    

With this, we have our input image in the device, awaiting to be processed, and an empty output buffer to store the result.


3b) Setting inputs and outputs in OpenCLIPER

  1. As in plain OpenCL, we load the input image, but instead of explicitly copying it to the device, we just have to associate it with our CL application (note that OpenCLIPER uses floats by default as the data type):

    std::shared_ptr<XData> pIn(new XData(pCLapp, {std::string("input.png")}, TYPEID_REAL));
    
  2. Now, the same goes for the output: first create it (empty) and associate it with our app:

    auto pOut = std::make_shared<XData>(pCLapp, pIn, false);
    

Once we have the shared pointers, we can be sure that the data have been pinned in memory to improve transfer speeds and are properly laid in the device memory, their offsets and sizes conveniently accessible from CL kernels (in this simple tutorial this is not noticeable but when you have to work with n-dimensional, multi-slice, heterogeneous data sets, it comes in very handy).


4a) Setting up and launching the process in OpenCL

  1. Before launching the kernel, we have to set its parameters (the input and output buffers):

    clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputBuffer1);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&outputBuffer);
    
  2. Now we can finally launch (enqueue in OpenCL terms) the kernel:

    size_t global_work_size[1] = {width * height};
    clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);
    

This was simple indeed, but note how the final user has to deal with the internals of your kernel (i.e. know how you programmed it) to be able to use it. The solution is, of course, to encapsulate the kernel in its own class (or procedures if you are stuck to pure C) and provide a convenient interface to the user. That is exactly what OpenCLIPER does already for you.


4b) Setting up and launching the process in OpenCLIPER

To overcome the the need to code a proper interface for your kernels, OpenCLIPER includes the concept of processes. Processes expose a simple way to initialize and launch kernels, as well as to change their parameters safely. Processes can contain as many kernels as you like, and they can be chained at your discretion at no cost (i.e. setting a process output as another process' input is zero-copy).

Note also that OpenCLIPER is OpenCL, so you can call your kernel directly if you so desire, but then you'll have to explain your users all the gory details about your kernel internals!

  1. So, first we need to create a process that encapsulates our kernel and exposes a nice interface to the user. To this end, we write its header file, Negate.hpp:

    #include <OpenCLIPER/CLapp.hpp>
    #include <OpenCLIPER/Process.hpp>
    
    namespace OpenCLIPER {
        class Negate : public OpenCLIPER::Process {
            public:
                Negate(std::shared_ptr<OpenCLIPER::CLapp> pCLapp): Process(pCLapp) {};
                Negate(std::shared_ptr<OpenCLIPER::CLapp> pCLapp, std::shared_ptr<Data> pInputData, std::shared_ptr<Data> pOutData): Process(pCLapp, pInputData, pOutData) {}
    
                void init();
                void launch(ProfileParameters profParms);
        };
    }
    

    And then its implementation file, Negate.cpp:

    #include <Negate.hpp>
    
    namespace OpenCLIPER {
        void Negate::init() {
            kernel = getApp()->getKernel("negate_kernel");
            queue = getApp()->getCommandQueue();
        }
    
        void Negate::launch(ProfileParameters profileParameters) {
            cl::Buffer* pInBuf = getInput()->getDeviceBuffer();
            cl::Buffer* pOutBuf = getOutput()->getDeviceBuffer();
    
            kernel.setArg(0, *pInBuf);
            kernel.setArg(1, *pOutBuf);
    
            cl::NDRange globalSizes = {NDARRAYWIDTH(getInput()->getNDArray(0)) * NDARRAYHEIGHT(getInput()->getNDArray(0))};
    
            queue.enqueueNDRangeKernel(kernel, cl::NullRange, globalSizes, cl::NDRange(), NULL, NULL);
        }
    }
    

    Note how kernel initialization and launching are separated so that lengthy initializations don't hinder consecutive executions. Additional parameters can be set by the user on-the-fly, but we don't use that functionality in this tutorial.

  2. Now that we have our process ready, we can link to it any buffer available in our app and launch it. Every process has an input and an output so that they can be chained easily:

    std::unique_ptr<Process> pProcess(new Negate(pCLapp, pIn, pOut));
    
    pProcess->init();
    pProcess->launch();
    

    ... and your users just see what they have (and want) to see. No more, no less.


5a) Saving results in OpenCL

Just the opposite of loading input data. First we copy the result image from the device to the host, and then save it:

    clEnqueueReadBuffer(commandQueue, outputDevBuffer, CL_TRUE, 0, width * height * sizeof(char), outputImage.data, 0, NULL, NULL);

    cv::imwrite("output.png", outputImage);


5b) Saving results in OpenCLIPER

We only have to call the save method (the result image is automatically copied to the host, albeit this time pinned memory is used, and then saved):

    pOut->save("output.png");

Summary

Here you have the listings of both versions side by side.

OpenCL

#include <CL/cl.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <string>
#include <fstream>
#include <sys/stat.h>
#include <sys/time.h>
#include <opencv2/opencv.hpp>

int convertToString(const char *filename,char **s) {
  FILE *f;
  struct stat fileInfo;
  size_t fileSize;

  f=fopen(filename,"r");
  if(f!=NULL) {
      fstat(f->_fileno,&fileInfo);
      fileSize=fileInfo.st_size;

      *s=(char*)malloc(fileSize+1);

      fread(*s,sizeof(char),fileSize,f);
      fclose(f);
      *s[fileSize]='\0';
      return 0;
  }
  fprintf(stderr,"Error: failed to open file %s\n",
      filename);
  return -1;
}

int main(int argc, char* argv[]) {
  cl_uint numPlatforms;
  cl_platform_id platform = NULL;
  cl_int status = clGetPlatformIDs(0, NULL,
      &numPlatforms);
  if (status != CL_SUCCESS) {
      fprintf(stderr,"Error: Getting platforms!\n");
      return 1;
  }

  if(numPlatforms > 0) {
    cl_platform_id* platforms =
      (cl_platform_id* )malloc(numPlatforms*
        sizeof(cl_platform_id));
      status = clGetPlatformIDs(numPlatforms,
        platforms, NULL);
      platform = platforms[0];
      free(platforms);
  }

  cl_uint numDevices = 0;
  cl_device_id *devices;
  status = clGetDeviceIDs(platform,
    CL_DEVICE_TYPE_GPU, 0, NULL, &numDevices);

  if (numDevices == 0) {
    fprintf(stderr,"No GPU device available.\n");
    fprintf(stderr,"Choose CPU as default device.\n");

    status = clGetDeviceIDs(platform,
        CL_DEVICE_TYPE_CPU, 0, NULL, &numDevices);
    devices = (cl_device_id*)malloc(numDevices *
        sizeof(cl_device_id));
    status = clGetDeviceIDs(platform,
        CL_DEVICE_TYPE_CPU, numDevices, devices,
        NULL);
  }
  else {
    devices = (cl_device_id*)malloc(
        numDevices * sizeof(cl_device_id));
    status = clGetDeviceIDs(platform,
        CL_DEVICE_TYPE_GPU, numDevices, devices,
        NULL);
  }

  cl_context context = clCreateContext(NULL,1,
     devices,NULL,NULL,NULL);
  cl_command_queue commandQueue =
    clCreateCommandQueue(context, devices[0], 0,
    NULL);

  const char *filename = "negate.cl";
  char *source;
  status = convertToString(filename, &source);
  size_t sourceSize[] = {strlen(source)};
  cl_program program = clCreateProgramWithSource(
    context, 1, (const char**)&source, sourceSize,
    NULL);

  status=clBuildProgram(program, 1,devices,NULL,
    NULL, NULL);
  printf("status_build=%d\n",status);
  char *info=(char*)malloc(1000000);
  clGetProgramBuildInfo(program,devices[0],
    CL_PROGRAM_BUILD_LOG,1000000,info,NULL);
  printf("log=%s\n",info);

  cl_kernel kernel = clCreateKernel(program,
    "negate_kernel", NULL);

  cv::Mat inputImage;
  inputImage = cv::imread("input.png",
    cv::IMREAD_GRAYSCALE);
  unsigned int width = inputImage.cols;
  unsigned int height = inputImage.rows;

  cl_mem inputDevBuffer = clCreateBuffer(context,
    CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,
    width * height * sizeof(char),
    (void *) inputImage.data, NULL);

  cv::Mat outputImage;
  outputImage = cv::Mat(width, height, CV_8U);
  cl_mem outputDevBuffer = clCreateBuffer(context,
    CL_MEM_WRITE_ONLY ,
    width * height * sizeof(char),
    NULL, NULL);

  status = clSetKernelArg(kernel, 0, sizeof(cl_mem),
    (void *)&inputDevBuffer);
  status = clSetKernelArg(kernel, 1, sizeof(cl_mem),
    (void *)&outputDevBuffer);

  size_t global_work_size[1] = {width * height};
  status = clEnqueueNDRangeKernel(commandQueue,
    kernel, 1, NULL, global_work_size, NULL, 0,
    NULL, NULL);

  clEnqueueReadBuffer(commandQueue,
    outputDevBuffer, CL_TRUE, 0,
    width * height * sizeof(char), outputImage.data,
    0, NULL, NULL);

  cv::imwrite("output.png", outputImage);

  status = clReleaseMemObject(outputDevBuffer);
  status = clReleaseMemObject(inputDevBuffer);
  status = clReleaseKernel(kernel);
  status = clReleaseProgram(program);
  status = clReleaseCommandQueue(commandQueue);
  status = clReleaseContext(context);
  return 0;
}

OpenCLIPER: main program

#include <OpenCLIPER/XData.hpp>
#include <OpenCLIPER/processes/examples/Negate.hpp>
#include <iostream>
#include <string>

using namespace OpenCLIPER;
int main(int argc, char *argv[]) {

  try {
    CLapp::PlatformTraits platformTraits;
    CLapp::DeviceTraits deviceTraits;
    auto pCLapp = CLapp::create(platformTraits, deviceTraits);

    pCLapp->loadKernels("examples/negate.cl");

    std::shared_ptr pIn(new XData(pCLapp,
      {std::string("input.png")}, TYPEID_REAL));

    auto pOut = std::make_shared(pCLapp, pIn, false);

    std::unique_ptr pProcess(new Negate(pCLapp, pIn, pOut));

    pProcess->init();
    pProcess->launch();

    pOut->save("output.png");

    pProcess.reset(nullptr);
    pIn = nullptr;
    pOut = nullptr;
    pCLapp = nullptr;
  } catch (std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
  }
}

OpenCLIPER: Negate.hpp

#include <OpenCLIPER/CLapp.hpp>
#include <OpenCLIPER/Process.hpp>

namespace OpenCLIPER {
  class Negate : public OpenCLIPER::Process {
    public:
        Negate(std::shared_ptr pCLapp): Process(pCLapp) {};
        Negate(std::shared_ptr pCLapp,
             std::shared_ptr pInputData,
             std::shared_ptr pOutData):
        Process(pCLapp, pInputData, pOutData) {}
        void init();
        void launch(ProfileParameters profParms);
    };
}

OpenCLIPER: Negate.cpp

#include <Negate.hpp>

namespace OpenCLIPER {
  void Negate::init() {
    kernel = getApp()->getKernel("negate_kernel");
    queue = getApp()->getCommandQueue();
  }

  void Negate::launch(ProfileParameters profParms) {
    cl::Buffer* pInBuf = getInput()->
      getDeviceBuffer();
    cl::Buffer* pOutBuf = getOutput()->
      getDeviceBuffer();

    kernel.setArg(0, *pInBuf);
    kernel.setArg(1, *pOutBuf);

    cl::NDRange globalSizes = {NDARRAYWIDTH(getInput()->
      getNDArray(0)) * NDARRAYHEIGHT(getInput()->
      getNDArray(0))};

    queue.enqueueNDRangeKernel(kernel, cl::NullRange,
      globalSizes, cl::NDRange(), NULL, NULL);
  }
}