“We are At The Dawn of an Era That Transcends CPU and GPU Computing”


Open computer vision (OpenCV) is an image processing software that is widely used and heavily documented. With the latest version having participation from big players like Intel and AMD, let us find out about the open computer language (OpenCL) acceleration layer added to it.

Dr. Harris Gasparakis, OpenCV manager, Computer Vision, AMD speaks to Priya Ravindran from EFY about graphical processing and AMD’s contribution to OpenCV. Take a look.

Dr. Harris Gasparakis, OpenCV manager, Computer Vision, AMD

Q. To set a base for this interview, how is graphical processing unit (GPU) computing different from central processing unit (CPU) computing when not handling graphics?
A. GPU computing is built with parallelism in mind. CPUs have multiple cores and support hyper-threading allowing it to have more than one executing thread per CPU (two threads per CPU core is a typical number). So, a retail CPU may have two to four cores, each core running one or two threads. To contrast this with GPUs, nominally, an integrated GPU will have eight cores, and a discrete GPU will have 40 cores. Clearly, you have massively more concurrent threads on a GPU than a CPU! To compensate for that, CPUs support vectorised instructions, also known as integer intrinsics, such as streaming SIMD extensions (SSE) and advanced vector extensions (AVX). However, taking advantage of the vector capabilities is a non-trivial task, and it is far from automatic. In fact, the modern trend, embodied by GPUs, is to employ scalar architectures, support multiple concurrent threads and let the compiler and the runtime do their magic!

Q. How in your opinion could computing evolve from here?
A. It is worth mentioning here that we are in the dawn of an era that transcends CPU and GPU computing, by combining both: I am talking about heterogeneous computing, where CPUs and GPUs are not only integrated on the same die, but efficiently synchronise access and efficiently operate on the same data from main memory, without redundant data copies. OpenCL 2.0 provides the application peripheral interface (API) that enables this.

Q.What are the main points to keep in mind while creating an OpenCL implementation?
A. There are two main elements in creating an OpenCL implementation or port of any library. First, you need to make sure that your data makes it to your processing cores, e.g., in a discrete GPU, you need to explicitly copy the data to on-board memory, from the main memory. That kind of approach will also work on an integrated device, but will not be efficient. The other element in porting a library to OpenCL is, you need to port the processing logic to OpenCL kernel syntax.

Q.What are the main differences between OpenCV-without-OpenCL and OpenCV-with-OpenCL? What new capabilities does it offer?
A. It was not about new functionality, rather, it was about accelerating the most common existing functionality to take advantage of GPUs. For copying data to on-board memory on an integrated GPU, you would want to use “zero copy”,i.e. you want the GPU to use the same data as the CPU. This was possible using OpenCL 1.2, but it is significantly easier with OpenCL 2.0. In OpenCV, our goal was to create an implementation that would work on all OpenCL capable devices, such as integrated or discreet GPUs. However, as part of OpenCV’s extensive automated testing, we make sure that the OpenCL functionality works correctly on a wide gamut of devices. Another thing is for the porting of processing logic. As far as embedded programming goes, OpenCL is a high-level, C-like language (and stay tuned for C++ goodness in the future), so that is not very difficult in principle. In practice, we wanted to have a high-performance implementation, so we fine tuned most of the OpenCL kernels for performance.

Q. Why did you choose to go with the transparent acceleration layer? How difficult was it to integrate this into OpenCV?
A. We introduced an OpenCL module in OpenCV 2.4. While we kept the names of most functions the same, taking advantage of OpenCL acceleration back then required ‘porting’ the code to use data structures and functions of the OpenCL module (under the ‘ocl’ namespace). While this was not difficult, it was a step I felt that we could do without! The goal of the transparent acceleration layer (T-API) was to enable people to write their code only once: If an OpenCL capable device is available, it will be used; otherwise, the fall back is CPU execution, which can also include accelerators like integrated performance primtives (IPP) or intrinsics like AVX/SSE. Detection of OpenCL devices happens at runtime, dynamically. Integration inside OpenCV was a significant effort. We sponsored the maintainers of OpenCV, they became very very excited with the vision, and carried it forward.

Q. Could you tell us how this interface works?
A. Certainly. My main idea was to introduce a new data structure, ‘universal’ or ‘unified’ matrix (Umat), to replace the ‘Mat’ data structure, which was historically the basic image data structure in OpenCV. The goal was to hide data locality under the UMat hood, in a way that is appropriate for each class of device. E.g., in the case of discreet GPUs, UMat would have the responsibility to synchronise data between the CPU and the GPU, by doing a copy. On the other extreme, for platforms supporting fine-grain shared virtual memory (SVM), an OpenCL 2.0 construct, the CPU and GPU can use exactly the same memory pointer (doing a memory map operation if needed). In the case where OpenCL is not available (or enabled), UMat just acts as a Mat. Therefore, code that uses UMat has nothing to lose (will work on CPUs) and potentially has very much to gain! As part of this effort, OpenCL moved into the core module and the 2.4 style ocl namespace got buried under the hood, and people do not need to use it anymore.

Q. To what effect can embedded engineers use OpenCV?
A. OpenCV, traditionally, has been great for prototyping, given the wealth of algorithms that have been implemented. OpenCV 3.0 brought about a general re-architecture, further strengthening the library in terms of having it organised in a more modular fashion. In the past, it was difficult to attain real-time performance on embedded CPUs. However, with AMD’s embedded product lines that feature cost effective, yet powerful integrated GPUs, it makes great sense to use them!

Q. Have you seen any changes with either the adaptation of OpenCV or its applications since the transparent API was introduced? Has there been any problem during implementation?
A. The OpenCV community is very excited with the recent release of 3.0. It is expected over time that 3.0 will replace previous generations (e.g. 2.4) and it is strongly encouraged to use UMat’s instead of Mat’s. That is all it takes to get OpenCL acceleration, in OpenCV!

Q. How far has OpenCL penetrated mobile and embedded systems?
A. The mobile space has unique requirements, where OpenCL may be less suited or at least is less publicised, compared to other technologies such as NEON intrinsics or render script. On high-end embedded systems, traditionally, vendors would employ specialised hardware solutions like digital signal processors (DSPs) or field-programmable gate arrays (FPGAs). However, embedded vendors now realise that there are credible, high performance and relatively low-cost embedded solutions using GPUs. In the cost, one has to factor in the total cost of ownership, as developing in generic language like OpenCL is arguably significantly cheaper than developing for a specialised processor.

Q. What would be the next step forward for OpenCV?
A. The next step in OpenCV will be to fully take advantage of the advanced features of OpenCL 2.0. I am particularly excited with the prospect of extensively using fine grain SVM, as this would enable efficient hybrid imaging pipelines, where one would mix and match CPU and GPU execution, with no performance penalty!

The author is a Technical Journalist at EFY.


Please enter your comment!
Please enter your name here