OpenCL


Open Computing Language OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units CPUs, graphics processing units GPUs, digital signal processors DSPs, field-programmable gate arrays FPGAs and other processors or hardware accelerators OpenCL specifies a programming language based on C99 for programming these devices and application programming interfaces APIs to control the platform and execute programs on the compute devices OpenCL provides a standard interface for parallel computing using task-based and data-based parallelism

OpenCL is an open standard maintained by the non-profit technology consortium Khronos Group Conformant implementations are available from Altera, AMD, Apple, ARM, Creative, IBM, Imagination, Intel, Nvidia, Qualcomm, Samsung, Vivante, Xilinx, and ZiiLABS89

Contents

  • 1 Overview
    • 11 Memory hierarchy
  • 2 OpenCL C language
    • 21 Example: matrix-vector multiplication
    • 22 Example: computing the FFT
  • 3 History
    • 31 OpenCL 10
    • 32 OpenCL 11
    • 33 OpenCL 12
    • 34 OpenCL 20
    • 35 OpenCL 21
    • 36 OpenCL 22
  • 4 Implementations
    • 41 Timeline of vendor implementations
  • 5 Devices
    • 51 Conformant products
  • 6 Extensions
    • 61 Device fission
  • 7 Portability, performance and alternatives
  • 8 See also
  • 9 References
  • 10 External links

Overviewedit

OpenCL views a computing system as consisting of a number of compute devices, which might be central processing units CPUs or "accelerators" such as graphics processing units GPUs, attached to a host processor a CPU It defines a C-like language for writing programs Functions executed on an OpenCL device are called "kernels"1:17 A single compute device typically consists of several compute units, which in turn comprise multiple processing elements PEs A single kernel execution can run on all or many of the PEs in parallel How a compute device is subdivided into compute units and PEs is up to the vendor; a compute unit can be thought of as a "core", but the notion of core is hard to define across all the types of devices supported by OpenCL or even within the category of "CPUs",10:49–50 and the number of compute units may not correspond to the number of cores claimed in vendors' marketing literature which may actually be counting SIMD lanes11

In addition to its C-like programming language, OpenCL defines an application programming interface API that allows programs running on the host to launch kernels on the compute devices and manage device memory, which is at least conceptually separate from host memory Programs in the OpenCL language are intended to be compiled at run-time, so that OpenCL-using applications are portable between implementations for various host devices12 The OpenCL standard defines host APIs for C and C++; third-party APIs exist for other programming languages and platforms such as Python,13 Java and NET10:15 An implementation of the OpenCL standard consists of a library that implements the API for C and C++, and an OpenCL C compiler for the compute devices targeted

In order to open the OpenCL programming model to other languages or to protect the kernel source from inspection, the Standard Portable Intermediate Representation SPIR14 can be used as a target-independent way to ship kernels between a front-end compiler and the OpenCL back-end

More recently Khronos Group has ratified SYCL,15 a higher-level programming model for OpenCL as single-source DSEL based on pure C++14 to improve programming productivity

Memory hierarchyedit

OpenCL defines a four-level memory hierarchy for the compute device:12

  • global memory: shared by all processing elements, but has high access latency;
  • read-only memory: smaller, low latency, writable by the host CPU but not the compute devices;
  • local memory: shared by a group of processing elements;
  • per-element private memory registers

Not every device needs to implement each level of this hierarchy in hardware Consistency between the various levels in the hierarchy is relaxed, and only enforced by explicit synchronization constructs, notably barriers

Devices may or may not share memory with the host CPU12 The host API provides handles on device memory buffers and functions to transfer data back and forth between host and devices

OpenCL C languageedit

The programming language that is used to write compute kernels is called OpenCL C and is based on C99,16 but adapted to fit the device model in OpenCL Memory buffers reside in specific levels of the memory hierarchy, and pointers are annotated with the region qualifiers __global, __local, __constant, and __private, reflecting this Instead of a device program having a main function, OpenCL C functions are marked __kernel to signal that they are entry points into the program to be called from the host program Function pointers, bit fields and variable-length arrays are omitted, recursion is forbidden17 The C standard library is replaced by a custom set of standard functions, geared toward math programming

OpenCL C is extended to facilitate use of parallelism with vector types and operations, synchronization, and functions to work with work-items and work-groups17 In particular, besides scalar types such as float and double, which behave similarly to the corresponding types in C, OpenCL provides fixed-length vector types such as float4 4-vector of single-precision floats; such vector types are available in lengths two, three, four, eight and sixteen for various base types16:§ 612 Vectorized operations on these types are intended to map onto SIMD instructions sets, eg, SSE or VMX, when running OpenCL programs on CPUs12 Other specialized types include 2-d and 3-d image types16:10–11

Example: matrix-vector multiplicationedit

Each invocation work-item of the kernel takes a row of the green matrix A in the code, multiplies this row with the red vector x and places the result in an entry of the blue vector y The number of columns n is passed to the kernel as ncols; the number of rows is implicit in the number of work-items produced by the host program

The following is a matrix-vector multiplication algorithm in OpenCL C

// Multiplies Ax, leaving the result in y // A is a row-major matrix, meaning the i,j element is at Aincols+j __kernel void matvec__global const float A, __global const float x, uint ncols, __global float y yi = sum;

The kernel function matvec computes, in each invocation, the dot product of a single row of a matrix A and a vector x:

y i = a i , : ⋅ x = ∑ j a i , j x j =a_\cdot x=\sum _a_x_

To extend this into a full matrix-vector multiplication, the OpenCL runtime maps the kernel over the rows of the matrix On the host side, the clEnqueueNDRangeKernel function does this; it takes as arguments the kernel to execute, its arguments, and a number of work-items, corresponding to the number of rows in the matrix A

Example: computing the FFTedit

This example will load a fast Fourier transform FFT implementation and execute it The implementation is shown below18 The code asks the OpenCL library for the first available graphics card, creates memory buffers for reading and writing from the perspective of the graphics card, JIT-compiles the FFT-kernel and then finally asynchroneously runs the kernel The result from the transform is not read in this example

// create a compute context with GPU device context = clCreateContextFromTypeNULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL; // create a command queue clGetDeviceIDs NULL, CL_DEVICE_TYPE_DEFAULT, 1, &device_id, NULL ; queue = clCreateCommandQueuecontext, device_id, 0, NULL; // allocate the buffer memory objects memobjs0 = clCreateBuffercontext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeoffloat2num_entries, srcA, NULL; memobjs1 = clCreateBuffercontext, CL_MEM_READ_WRITE, sizeoffloat2num_entries, NULL, NULL; // create the compute program program = clCreateProgramWithSourcecontext, 1, &fft1D_1024_kernel_src, NULL, NULL; // build the compute program executable clBuildProgramprogram, 0, NULL, NULL, NULL, NULL; // create the compute kernel kernel = clCreateKernelprogram, "fft1D_1024", NULL; // set the args values clSetKernelArgkernel, 0, sizeofcl_mem, void &memobjs0; clSetKernelArgkernel, 1, sizeofcl_mem, void &memobjs1; clSetKernelArgkernel, 2, sizeoffloatlocal_work_size0+116, NULL; clSetKernelArgkernel, 3, sizeoffloatlocal_work_size0+116, NULL; // create N-D range object with work-item dimensions and execute kernel global_work_size0 = num_entries; local_work_size0 = 64; //Nvidia: 192 or 256 clEnqueueNDRangeKernelqueue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL;

The actual calculation based on Fitting FFT onto the G80 Architecture:19

// This kernel computes FFT of length 1024 The 1024 length FFT is decomposed into // calls to a radix 16 function, another radix 16 function and then a radix 4 function __kernel void fft1D_1024 __global float2 in, __global float2 out, __local float sMemx, __local float sMemy

A full, open source implementation of an OpenCL FFT can be found on Apple's website20

Historyedit

OpenCL was initially developed by Apple Inc, which holds trademark rights, and refined into an initial proposal in collaboration with technical teams at AMD, IBM, Qualcomm, Intel, and Nvidia Apple submitted this initial proposal to the Khronos Group On June 16, 2008, the Khronos Compute Working Group was formed21 with representatives from CPU, GPU, embedded-processor, and software companies This group worked for five months to finish the technical details of the specification for OpenCL 10 by November 18, 200822 This technical specification was reviewed by the Khronos members and approved for public release on December 8, 200823

OpenCL 10edit

OpenCL 10 released with Mac OS X Snow Leopard on August 28, 2009 According to an Apple press release:24

Snow Leopard further extends support for modern hardware with Open Computing Language OpenCL, which lets any application tap into the vast gigaflops of GPU computing power previously available only to graphics applications OpenCL is based on the C programming language and has been proposed as an open standard

AMD decided to support OpenCL instead of the now deprecated Close to Metal in its Stream framework2526 RapidMind announced their adoption of OpenCL underneath their development platform to support GPUs from multiple vendors with one interface27 On December 9, 2008, Nvidia announced its intention to add full support for the OpenCL 10 specification to its GPU Computing Toolkit28 On October 30, 2009, IBM released its first OpenCL implementation as a part of the XL compilers29

OpenCL 11edit

OpenCL 11 was ratified by the Khronos Group on June 14, 201030 and adds significant functionality for enhanced parallel programming flexibility, functionality, and performance including:

  • New data types including 3-component vectors and additional image formats;
  • Handling commands from multiple host threads and processing buffers across multiple devices;
  • Operations on regions of a buffer including read, write and copy of 1D, 2D, or 3D rectangular regions;
  • Enhanced use of events to drive and control command execution;
  • Additional OpenCL built-in C functions such as integer clamp, shuffle, and asynchronous strided copies;
  • Improved OpenGL interoperability through efficient sharing of images and buffers by linking OpenCL and OpenGL events

OpenCL 12edit

On November 15, 2011, the Khronos Group announced the OpenCL 12 specification,31 which added significant functionality over the previous versions in terms of performance and features for parallel programming Most notable features include:

  • Device partitioning: the ability to partition a device into sub-devices so that work assignments can be allocated to individual compute units This is useful for reserving areas of the device to reduce latency for time-critical tasks
  • Separate compilation and linking of objects: the functionality to compile OpenCL into external libraries for inclusion into other programs
  • Enhanced image support: 12 adds support for 1D images and 1D/2D image arrays Furthermore, the OpenGL sharing extensions now allow for OpenGL 1D textures and 1D/2D texture arrays to be used to create OpenCL images
  • Built-in kernels: custom devices that contain specific unique functionality are now integrated more closely into the OpenCL framework Kernels can be called to use specialised or non-programmable aspects of underlying hardware Examples include video encoding/decoding and digital signal processors
  • DirectX functionality: DX9 media surface sharing allows for efficient sharing between OpenCL and DX9 or DXVA media surfaces Equally, for DX11, seamless sharing between OpenCL and DX11 surfaces is enabled
  • The ability to force IEEE 754 compliance for single precision floating point math: OpenCL by default allows the single precision versions of the division, reciprocal, and square root operation to be less accurate than the correctly rounded values that IEEE 754 requires32 If the programmer passes the "-cl-fp32-correctly-rounded-divide-sqrt" command line argument to the compiler, these three operations will be computed to IEEE 754 requirements if the OpenCL implementation supports this, and will fail to compile if the OpenCL implementation does not support computing these operations to their correctly-rounded values as defined by the IEEE 754 specification32 This ability is supplemented by the ability to query the OpenCL implementation to determine if it can perform these operations to IEEE 754 accuracy32

OpenCL 20edit

On November 18, 2013, the Khronos Group announced the ratification and public release of the finalized OpenCL 20 specification33 Updates and additions to OpenCL 20 include:

  • Shared virtual memory
  • Nested parallelism
  • Generic address space
  • Images
  • C11 atomics
  • Pipes
  • Android installable client driver extension

OpenCL 21edit

The ratification and release of the OpenCL 21 provisional specification was announced on March 3, 2015 at the Game Developer Conference in San Francisco It was released on November 16, 201534 It replaces the OpenCL C kernel language with OpenCL C++, a subset of C++14 Vulkan and OpenCL 21 share SPIR-V as an intermediate representation allowing high-level language front-ends to share a common compilation target Updates to the OpenCL API include:

  • Additional subgroup functionality
  • Copying of kernel objects and states
  • Low-latency device timer queries
  • Ingestion of SPIR-V code by runtime
  • Execution priority hints for queues
  • Zero-sized dispatches from host

AMD, ARM, Intel, HPC, and YetiWare have declared support for OpenCL 213536

OpenCL 22edit

OpenCL 22 brings the OpenCL C++ kernel language into the core specification for significantly enhanced parallel programming productivity:373839

  • The OpenCL C++ kernel language is a static subset of the C++14 standard and includes classes, templates, lambda expressions, function overloads and many other constructs for generic and meta-programming
  • Leverages the new Khronos SPIR-V 11 intermediate language which fully supports the OpenCL C++ kernel language
  • OpenCL library functions can now take advantage of the C++ language to provide increased safety and reduced undefined behavior while accessing features such as atomics, iterators, images, samplers, pipes, and device queue built-in types and address spaces
  • Pipe storage is a new device-side type in OpenCL 22 that is useful for FPGA implementations by making connectivity size and type known at compile time, enabling efficient device-scope communication between kernels
  • OpenCL 22 also includes features for enhanced optimization of generated code: Applications can provide the value of specialization constant at SPIR-V compilation time, a new query can detect non-trivial constructors and destructors of program scope global objects, and user callbacks can be set at program release time
  • Runs on any OpenCL 20-capable hardware Only driver update required

Implementationsedit

OpenCL consists of a set of headers and a shared object that is loaded at runtime An installable client driver ICD must be installed on the platform for every class of vendor for which the runtime would need to support That is, for example, in order to support Nvidia devices on a Linux platform, the Nvidia ICD would need to be installed such that the OpenCL runtime the ICD loader would be able to locate the ICD for the vendor and redirect the calls appropriately The standard OpenCL header is used by the consumer application; calls to each function are then proxied by the OpenCL runtime to the appropriate driver using the ICD Each vendor must implement each OpenCL call in their driver40

A number of open source implementations of the OpenCL ICD exist, including freeocl,4142 and ocl-icd43

MESA Gallium Compute: An implementation of OpenCL for a number of platforms is maintained as part of the Gallium Compute Project,44 which builds on the work of the Mesa project to support multiple platforms Formerly this was known as CLOVER45 Actual OpenCL 10, 11 and 12 are in progress mostly for AMD Radeon but many sample tests were failed for Khronos Conformance

BEIGNET: An implementation by Intel for its Ivy Bridge + hardware was released in 201346 This software of Intel China Team, called "Beignet", is not based on Mesa/Gallium, which has attracted criticism from developers at AMD and Red Hat,47 as well as Michael Larabel of Phoronix48 Actual Version 121 support OpenCL 12 Ivy Bridge and higher4950 Version for support of 20 is in work Android is also possible in Beignet51

POCL: A CPU-only version building on Clang and LLVM, called pocl, is intended to be a portable OpenCL implementation,5253 Actual Version is 013 with some bugs and not full support of OpenCL 12 and partial support of 205455

Shamrock is a Port of Mesa Clover for ARM,5657 Actual support is OpenCL 12 with Target 20

ROCm: A new open source Linux Compute project is Radeon Open Compute ROCm for Radeon Graphics GCN 3 and 4 Hawaii, Fiji, Polaris and Intel Gen75+ CPU Haswell+ or new AMD Ryzen with PCIe 30 OpenCL 12 with a basic part of 20 are supported58

Timeline of vendor implementationsedit

  • December 10, 2008: AMD and Nvidia held the first public OpenCL demonstration, a 75-minute presentation at Siggraph Asia 2008 AMD showed a CPU-accelerated OpenCL demo explaining the scalability of OpenCL on one or more cores while Nvidia showed a GPU-accelerated demo5960
  • March 16, 2009: at the 4th Multicore Expo, Imagination Technologies announced the PowerVR SGX543MP, the first GPU of this company to feature OpenCL support61
  • March 26, 2009: at GDC 2009, AMD and Havok demonstrated the first working implementation for OpenCL accelerating Havok Cloth on AMD Radeon HD 4000 series GPU62
  • April 20, 2009: Nvidia announced the release of its OpenCL driver and SDK to developers participating in its OpenCL Early Access Program63
  • August 5, 2009: AMD unveiled the first development tools for its OpenCL platform as part of its ATI Stream SDK v20 Beta Program64
  • August 28, 2009: Apple released Mac OS X Snow Leopard, which contains a full implementation of OpenCL65
OpenCL in Snow Leopard is supported on the Nvidia GeForce 320M, GeForce GT 330M, GeForce 9400M, GeForce 9600M GT, GeForce 8600M GT, GeForce GT 120, GeForce GT 130, GeForce GTX 285, GeForce 8800 GT, GeForce 8800 GS, Quadro FX 4800, Quadro FX5600, ATI Radeon HD 4670, ATI Radeon HD 4850, Radeon HD 4870, ATI Radeon HD 5670, ATI Radeon HD 5750, ATI Radeon HD 5770 and ATI Radeon HD 587066
  • September 28, 2009: Nvidia released its own OpenCL drivers and SDK implementation
  • October 13, 2009: AMD released the fourth beta of the ATI Stream SDK 20, which provides a complete OpenCL implementation on both R700/R800 GPUs and SSE3 capable CPUs The SDK is available for both Linux and Windows67
  • November 26, 2009: Nvidia released drivers for OpenCL 10 rev 48
The Apple,68 Nvidia,69 RapidMind70 and Gallium3D71 implementations of OpenCL are all based on the LLVM Compiler technology and use the Clang Compiler as its frontend
  • October 27, 2009: S3 released their first product supporting native OpenCL 10 – the Chrome 5400E embedded graphics processor72
  • December 10, 2009: VIA released their first product supporting OpenCL 10 – ChromotionHD 20 video processor included in VN1000 chipset73
  • December 21, 2009: AMD released the production version of the ATI Stream SDK 20,74 which provides OpenCL 10 support for R800 GPUs and beta support for R700 GPUs
  • June 1, 2010: ZiiLABS released details of their first OpenCL implementation for the ZMS processor for handheld, embedded and digital home products75
  • June 30, 2010: IBM released a fully conformant version of OpenCL 1076
  • September 13, 2010: Intel released details of their first OpenCL implementation for the Sandy Bridge chip architecture Sandy Bridge will integrate Intel's newest graphics chip technology directly onto the central processing unit77
  • November 15, 2010: Wolfram Research released Mathematica 8 with OpenCLLink package
  • March 3, 2011: Khronos Group announces the formation of the WebCL working group to explore defining a JavaScript binding to OpenCL This creates the potential to harness GPU and multi-core CPU parallel processing from a Web browser7879
  • March 31, 2011: IBM released a fully conformant version of OpenCL 117680
  • April 25, 2011: IBM released OpenCL Common Runtime v01 for Linux on x86 Architecture81
  • May 4, 2011: Nokia Research releases an open source WebCL extension for the Firefox web browser, providing a JavaScript binding to OpenCL82
  • July 1, 2011: Samsung Electronics releases an open source prototype implementation of WebCL for WebKit, providing a JavaScript binding to OpenCL83
  • August 8, 2011: AMD released the OpenCL-driven AMD Accelerated Parallel Processing APP Software Development Kit SDK v25, replacing the ATI Stream SDK as technology and concept84
  • December 12, 2011: AMD released AMD APP SDK v2685 which contains a preview of OpenCL 12
  • February 27, 2012: The Portland Group released the PGI OpenCL compiler for multi-core ARM CPUs86
  • April 17, 2012: Khronos released a WebCL working draft87
  • May 6, 2013: Altera released the Altera SDK for OpenCL, version 13088 It is conformant to OpenCL 1089
  • November 18, 2013: Khronos announced that the specification for OpenCL 20 had been finalized90
  • March 19, 2014: Khronos releases the WebCL 10 specification9192
  • August 29, 2014: Intel releases HD Graphics 5300 driver that supports OpenCL 2093
  • September 25, 2014: AMD releases Catalyst 1441 RC1, which includes an OpenCL 20 driver94
  • April 13, 2015: Nvidia releases WHQL driver v35012, which includes OpenCL 12 support for GPUs based on Kepler or later architectures95 Driver 340+ support OpenCL 11 for Tesla and Fermi
  • August 26, 2015: AMD released AMD APP SDK v3096 which contains full support of OpenCL 20 and sample coding
  • November 16, 2015: Khronos announced that the specification for OpenCL 21 had been finalized97
  • April 18, 2016: Khronos announced that the specification for OpenCL 22 had been provisional finalized38

Devicesedit

As of 2016 OpenCL runs on Graphics processing units, CPUs with SIMD instructions, FPGAs, Movidius Myriad 2, Adapteva epiphany and DSPs

Conformant productsedit

The Khronos Group maintains an extended list of OpenCL-conformant products5

Synopsis of OpenCL conformant products5
AMD APP SDK supports OpenCL CPU and accelerated processing unit Devices, GPU: Terascale 1: OpenCL 11, Terascale 2: 12, GCN 1: 12+, GCN 2+: 20+ X86 + SSE2 or higher compatible CPUs 64-bit & 32-bit,98 Linux 26 PC, Windows Vista/7/8x/10 PC AMD Fusion E-350, E-240, C-50, C-30 with HD 6310/HD 6250 AMD Radeon/Mobility HD 6800, HD 5x00 series GPU, iGPU HD 6310/HD 6250, HD 7xxx, HD 8xxx, R2xx, R3xx, RX 4xx AMD FirePro Vx800 series GPU and later, Radeon Pro
Intel SDK for OpenCL Applications 201399 supports Intel Core processors and Intel HD Graphics 4000/2500 Intel CPUs with SSE 41, SSE 42 or AVX support100101 Microsoft Windows, Linux Intel Core i7, i5, i3; 2nd Generation Intel Core i7/5/3, 3rd Generation Intel Core Processors with Intel HD Graphics 4000/2500 Intel Core 2 Solo, Duo Quad, Extreme Intel Xeon 7x00,5x00,3x00 Core based
IBM Servers with OpenCL Development Kit for Linux on Power running on Power VSX102103 IBM Power 755 PERCS, 750 IBM BladeCenter PS70x Express IBM BladeCenter JS2x, JS43 IBM BladeCenter QS22
IBM OpenCL Common Runtime OCR

104

X86 + SSE2 or higher compatible CPUs 64-bit & 32-bit;105 Linux 26 PC AMD Fusion, Nvidia Ion and Intel Core i7, i5, i3; 2nd Generation Intel Core i7/5/3 AMD Radeon, Nvidia GeForce and Intel Core 2 Solo, Duo, Quad, Extreme ATI FirePro, Nvidia Quadro and Intel Xeon 7x00,5x00,3x00 Core based
Nvidia OpenCL Driver and Tools,106 Chips: Tesla, Fermi : OpenCL 11Driver 340+, Kepler, Maxwell, Pascal: OpenCL 12 Driver 370+ Nvidia Tesla C/D/S Nvidia GeForce GTS/GT/GTX, Nvidia Ion Nvidia Quadro FX/NVX/Plex, Quadro, Quadro K, Quadro M, Quadro P,

Extensionsedit

Some vendors provide extended functionality over the standard OpenCL specification via the means of extensions These are still specified by Khronos but provided by vendors within their SDKs They often contain features that are to be implemented in the future – for example device fission functionality was originally an extension but is now provided as part of the 12 specification

Extensions provided in the 12 specification include:

  • Writing to 3D image memory objects
  • Half-precision floating-point format
  • Sharing memory objects with OpenGL
  • Creating event objects from GL sync objects
  • Sharing memory objects with Direct3D 10
  • DX9 media Surface Sharing
  • Sharing Memory Objects with Direct3D 11

Device fissionedit

Device fission – introduced fully into the OpenCL standard with version 12 – allows individual command queues to be used for specific areas of a device For example, within the Intel SDK, a command queue can be created that maps directly to an individual core AMD also provides functionality for device fission, also originally as an extension Device fission can be used where the availability of compute is required reliably, such as in a latency sensitive environment Fission effectively reserves areas of the device for computation

Portability, performance and alternativesedit

A key feature of OpenCL is portability, via its abstracted memory and execution model, and the programmer is not able to directly use hardware-specific technologies such as inline Parallel Thread Execution PTX for Nvidia GPUs unless they are willing to give up direct portability on other platforms It is possible to run any OpenCL kernel on any conformant implementation

However, performance of the kernel is not necessarily portable across platforms Existing implementations have been shown to be competitive when kernel code is properly tuned, though, and auto-tuning has been suggested as a solution to the performance portability problem,107 yielding "acceptable levels of performance" in experimental linear algebra kernels108 Portability of an entire application containing multiple kernels with differing behaviors was also studied, and shows that portability only required limited tradeoffs109

A study at Delft University that compared CUDA programs and their straightforward translation into OpenCL C found CUDA to outperform OpenCL by at most 30% on the Nvidia implementation The researchers noted that their comparison could be made fairer by applying manual optimizations to the OpenCL programs, in which case there was "no reason for OpenCL to obtain worse performance than CUDA" The performance differences could mostly be attributed to differences in the programming model especially the memory model and to NVIDIA's compiler optimizations for CUDA compared to those for OpenCL107

Another study at D-Wave Systems Inc found that "The OpenCL kernel’s performance is between about 13% and 63% slower, and the end-to-end time is between about 16% and 67% slower" than CUDA's performance110

The fact that OpenCL allows workloads to be shared by CPU and GPU, executing the same programs, means that programmers can exploit both by dividing work among the devices111 This leads to the problem of deciding how to partition the work, because the relative speeds of operations differ among the devices Machine learning has been suggested to solve this problem: Grewe and O'Boyle describe a system of support vector machines trained on compile-time features of program that can decide the device partitioning problem statically, without actually running the programs to measure their performance112

See alsoedit

  • Advanced Simulation Library
  • AMD FireStream
  • BrookGPU
  • C++ AMP
  • Close to Metal
  • CUDA
  • DirectCompute
  • GPGPU
  • Larrabee
  • Lib Sh
  • List of OpenCL applications
  • OpenACC
  • OpenGL
  • OpenHMPP
  • OpenMP
  • Metal
  • Renderscript
  • SequenceL
  • SIMD
  • Vulkan

Referencesedit

  1. ^ a b Howes, Lee November 11, 2015 "The OpenCL Specification Version: 21 Document Revision: 23" PDF Khronos OpenCL Working Group Retrieved November 16, 2015 
  2. ^ Bourd, Alex 11 March 2016 "The OpenCL Specification Version: 22 Document Revision: 06" PDF Khronos OpenCL Working Group Retrieved 29 April 2016 
  3. ^ "Android Devices With OpenCL support" Google Docs ArrayFire Retrieved April 28, 2015 
  4. ^ "FreeBSD Graphics/OpenCL" FreeBSD Retrieved 23 December 2015 
  5. ^ a b c "Conformant Products" Khronos Group Retrieved May 9, 2015 
  6. ^ Munshi, Aaftab; Howes, Lee; Sochaki, Barosz 13 April 2016 "The OpenCL C Specification Version: 20 Document Revision: 33" PDF Khronos OpenCL Working Group Retrieved 29 April 2016 
  7. ^ Munshi, Aaftab March 2, 2015 "The OpenCL C++ Specification Version: 10 Document Revision: 08" PDF Khronos OpenCL Working Group Retrieved April 16, 2015 
  8. ^ "Conformant Companies" Khronos Group Retrieved April 8, 2015 
  9. ^ Gianelli, Silvia E January 14, 2015 "Xilinx SDAccel Development Environment for OpenCL, C, and C++, Achieves Khronos Conformance" PR Newswire Xilinx Retrieved April 27, 2015 
  10. ^ a b Gaster, Benedict; Howes, Lee; Kaeli, David R; Mistry, Perhaad; Schaa, Dana 2012 Heterogeneous Computing with OpenCL: Revised OpenCL 12 Edition Morgan Kaufmann 
  11. ^ Tompson, Jonathan; Schlachter, Kristofer 2012 "An Introduction to the OpenCL Programming Model" PDF New York University Media Research Lab Retrieved July 6, 2015 
  12. ^ a b c d Stone, John E; Gohara, David; Shi, Guochin 2010 "OpenCL: a parallel programming standard for heterogeneous computing systems" Computing in Science & Engineering doi:101109/MCSE201069 
  13. ^ Klöckner, Andreas; Pinto, Nicolas; Lee, Yunsup; Catanzaro, Bryan; Ivanov, Paul; Fasih, Ahmed 2012 "PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation" Parallel Computing 38 3: 157–174 arXiv:09113456 doi:101016/jparco201109001 
  14. ^ https://wwwkhronosorg/spir/
  15. ^ https://wwwkhronosorg/sycl/
  16. ^ a b c Aaftab Munshi, ed 2014 "The OpenCL C Specification, Version 20" PDF Retrieved June 24, 2014 
  17. ^ a b AMD Introduction to OpenCL Programming 201005, page 89-90 Archived May 16, 2011, at the Wayback Machine
  18. ^ "OpenCL" PDF SIGGRAPH2008 August 14, 2008 Retrieved August 14, 2008 
  19. ^ "Fitting FFT onto G80 Architecture" PDF Vasily Volkov and Brian Kazian, UC Berkeley CS258 project report May 2008 Retrieved November 14, 2008 
  20. ^ "OpenCL on FFT" Apple November 16, 2009 Retrieved December 7, 2009 
  21. ^ "Khronos Launches Heterogeneous Computing Initiative" Press release Khronos Group June 16, 2008 Retrieved June 18, 2008 
  22. ^ "OpenCL gets touted in Texas" MacWorld November 20, 2008 Retrieved June 12, 2009 
  23. ^ "The Khronos Group Releases OpenCL 10 Specification" Press release Khronos Group December 8, 2008 Retrieved December 4, 2016 
  24. ^ "Apple Previews Mac OS X Snow Leopard to Developers" Press release Apple Inc June 9, 2008 Retrieved June 9, 2008 
  25. ^ "AMD Drives Adoption of Industry Standards in GPGPU Software Development" Press release AMD August 6, 2008 Retrieved August 14, 2008 
  26. ^ "AMD Backs OpenCL, Microsoft DirectX 11" eWeek August 6, 2008 Retrieved August 14, 2008 
  27. ^ "HPCWire: RapidMind Embraces Open Source and Standards Projects" HPCWire November 10, 2008 Archived from the original on December 18, 2008 Retrieved November 11, 2008 
  28. ^ "Nvidia Adds OpenCL To Its Industry Leading GPU Computing Toolkit" Press release Nvidia December 9, 2008 Retrieved December 10, 2008 
  29. ^ "OpenCL Development Kit for Linux on Power" alphaWorks October 30, 2009 Retrieved October 30, 2009 
  30. ^ "Khronos Drives Momentum of Parallel Computing Standard with Release of OpenCL 11 Specification" Retrieved 2016-02-24 
  31. ^ "Khronos Releases OpenCL 12 Specification" Khronos Group November 15, 2011 Retrieved June 23, 2015 
  32. ^ a b c "OpenCL 12 Specification" PDF Khronos Group Retrieved June 23, 2015 
  33. ^ "Khronos Finalizes OpenCL 20 Specification for Heterogeneous Computing" Khronos Group November 18, 2013 Retrieved February 10, 2014 
  34. ^ "Khronos Releases OpenCL 21 and SPIR-V 10 Specifications for Heterogeneous Parallel Programming" Khronos Group November 16, 2015 Retrieved November 16, 2015 
  35. ^ "Khronos Announces OpenCL 21: C++ Comes to OpenCL" AnandTech March 3, 2015 Retrieved April 8, 2015 
  36. ^ "Khronos Releases OpenCL 21 Provisional Specification for Public Review" Kronos Group March 3, 2015 Retrieved April 8, 2015 
  37. ^ https://wwwkhronosorg/opencl/
  38. ^ a b https://wwwkhronosorg/news/press/khronos-releases-opencl-22-provisional-spec-opencl-c-kernel-language
  39. ^ Trevett, Neil April 2016 "OpenCL – A State of the Union" PDF IWOCLorg Vienna: Khronos Group Retrieved 2017-01-02  line feed character in |title= at position 25 help
  40. ^ "OpenCL ICD Specification" Retrieved June 23, 2015 
  41. ^ "freeocl – Multi-platform implementation of OpenCL 12 targeting CPUs" codegooglecom Retrieved June 23, 2015 
  42. ^ https://githubcom/zuzuf/freeocl
  43. ^ "OpenCL ICD Loader" forgeimagfr Retrieved June 23, 2015 
  44. ^ "GalliumCompute" drifreedesktoporg Retrieved June 23, 2015 
  45. ^ https://wwwxorg/wiki/Events/XDC2013/XDC2013TomStellardCloverStatus/XDC2013TomStellardCloverStatuspdf
  46. ^ Michael Larabel January 10, 2013 "Beignet: OpenCL/GPGPU Comes For Ivy Bridge On Linux" Phoronix 
  47. ^ Michael Larabel April 16, 2013 "More Criticism Comes Towards Intel's Beignet OpenCL" Phoronix 
  48. ^ Michael Larabel December 24, 2013 "Intel's Beignet OpenCL Is Still Slowly Baking" Phoronix 
  49. ^ https://freedesktoporg/wiki/Software/Beignet/
  50. ^ https://cgitfreedesktoporg/beignet/
  51. ^ https://wwwphoronixcom/scanphppage=news_item&px=Intel-Beignet-Android
  52. ^ Jääskeläinen, Pekka; Sánchez de La Lama, Carlos; Schnetter, Erik; Raiskila, Kalle; Takala, Jarmo; Berg, Heikki 2014 "pocl: A Performance-Portable OpenCL Implementation" Int'l J Parallel Programming doi:101007/s10766-014-0320-y 
  53. ^ https://tutcristutfi/portal/files/5075042/poclpdf
  54. ^ http://portableclorg/pocl-013html
  55. ^ http://portableclorg/docs/html/featureshtml
  56. ^ https://gitlinaroorg/gpgpu/shamrockgit/about/
  57. ^ https://s3amazonawscom/connectlinaroorg/lca14/presentations/LCA14-412-%20GPGPU%20on%20ARM%20SoC%20sessionpdf
  58. ^ https://githubcom/RadeonOpenCompute/ROCm
  59. ^ "OpenCL Demo, AMD CPU" December 10, 2008 Retrieved March 28, 2009 
  60. ^ "OpenCL Demo, Nvidia GPU" December 10, 2008 Retrieved March 28, 2009 
  61. ^ "Imagination Technologies launches advanced, highly-efficient POWERVR SGX543MP multi-processor graphics IP family" Imagination Technologies March 19, 2009 Retrieved January 30, 2011 
  62. ^ "AMD and Havok demo OpenCL accelerated physics" PC Perspective March 26, 2009 Archived from the original on April 5, 2009 Retrieved March 28, 2009 
  63. ^ "Nvidia Releases OpenCL Driver To Developers" Nvidia April 20, 2009 Retrieved April 27, 2009 
  64. ^ "AMD does reverse GPGPU, announces OpenCL SDK for x86" Ars Technica August 5, 2009 Retrieved August 6, 2009 
  65. ^ Dan Moren; Jason Snell June 8, 2009 "Live Update: WWDC 2009 Keynote" macworldcom MacWorld Retrieved June 12, 2009 
  66. ^ "Mac OS X Snow Leopard – Technical specifications and system requirements" Apple Inc March 23, 2011 Retrieved March 23, 2011 
  67. ^ "ATI Stream Software Development Kit SDK v20 Beta Program" Archived from the original on August 9, 2009 Retrieved October 14, 2009 
  68. ^ "Apple entry on LLVM Users page" Retrieved August 29, 2009 
  69. ^ "Nvidia entry on LLVM Users page" Retrieved August 6, 2009 
  70. ^ "Rapidmind entry on LLVM Users page" Retrieved October 1, 2009 
  71. ^ "Zack Rusin's blog post about the Gallium3D OpenCL implementation" Retrieved October 1, 2009 
  72. ^ "S3 Graphics launched the Chrome 5400E embedded graphics processor" Archived from the original on December 2, 2009 Retrieved October 27, 2009 
  73. ^ "VIA Brings Enhanced VN1000 Graphics Processor" Retrieved December 10, 2009 
  74. ^ "ATI Stream SDK v20 with OpenCL 10 Support" Retrieved October 23, 2009 
  75. ^ "OpenCL" ZiiLABS Retrieved June 23, 2015 
  76. ^ a b "Khronos Group Conformant Products" 
  77. ^ "Intel discloses new Sandy Bridge technical details" Retrieved September 13, 2010 
  78. ^ "WebCL related stories" Khronos Group Retrieved June 23, 2015 
  79. ^ "Khronos Releases Final WebGL 10 Specification" Khronos Group Retrieved June 23, 2015 
  80. ^ "OpenCL Development Kit for Linux on Power" 
  81. ^ "About the OpenCL Common Runtime for Linux on x86 Architecture" 
  82. ^ "Nokia Research releases WebCL prototype" Khronos Group May 4, 2011 Retrieved June 23, 2015 
  83. ^ SharathKamathK "Samsung's WebCL Prototype for WebKit" Githubcom Retrieved June 23, 2015 
  84. ^ "AMD Opens the Throttle on APU Performance with Updated OpenCL Software Development " Amdcom August 8, 2011 Retrieved June 16, 2013 
  85. ^ "AMD APP SDK v26" Forumsamdcom March 13, 2015 Retrieved June 23, 2015 
  86. ^ "The Portland Group Announces OpenCL Compiler for ST-Ericsson ARM-Based NovaThor SoCs" Retrieved May 4, 2012 
  87. ^ "WebCL Latest Spec" cvskhronosorg November 7, 2013 Retrieved June 23, 2015 
  88. ^ "Altera Opens the World of FPGAs to Software Programmers with Broad Availability of SDK and Off-the-Shelf Boards for OpenCL" Alteracom Retrieved January 9, 2014 
  89. ^ "Altera SDK for OpenCL is First in Industry to Achieve Khronos Conformance for FPGAs" Alteracom Retrieved January 9, 2014 
  90. ^ "Khronos Finalizes OpenCL 20 Specification for Heterogeneous Computing" Khronos Group November 18, 2013 Retrieved June 23, 2015 
  91. ^ "WebCL 10 Press Release" Khronos Group March 19, 2014 Retrieved June 23, 2015 
  92. ^ "WebCL 10 Specification" Khronos Group March 14, 2014 Retrieved June 23, 2015 
  93. ^ Intel OpenCL 20 Driver
  94. ^ "AMD OpenCL 20 Driver" supportamdcom June 17, 2015 Retrieved June 23, 2015 
  95. ^ "Release 349 Graphics Drivers for Windows, Version 35012" PDF April 13, 2015 Retrieved February 4, 2016 
  96. ^ "AMD APP SDK 30 Released" developeramdcom August 26, 2015 Retrieved September 11, 2015 
  97. ^ https://wwwkhronosorg/news/press/khronos-releases-opencl-21-and-spir-v-10-specifications-for-heterogeneous
  98. ^ "OpenCL and the AMD APP SDK" AMD Developer Central developeramdcom Archived from the original on August 4, 2011 Retrieved August 11, 2011 
  99. ^ "About Intel OpenCL SDK 11" softwareintelcom intelcom Retrieved August 11, 2011 
  100. ^ "Product Support" Retrieved August 11, 2011 
  101. ^ "Intel OpenCL SDK – Release Notes" Archived from the original on July 17, 2011 Retrieved August 11, 2011 
  102. ^ "Announcing OpenCL Development Kit for Linux on Power v03" Retrieved August 11, 2011 
  103. ^ "IBM releases OpenCL Development Kit for Linux on Power v03 – OpenCL 11 conformant release available" OpenCL Lounge ibmcom Retrieved August 11, 2011 
  104. ^ "IBM releases OpenCL Common Runtime for Linux on x86 Architecture" Retrieved September 10, 2011 
  105. ^ "OpenCL and the AMD APP SDK" AMD Developer Central developeramdcom Archived from the original on September 6, 2011 Retrieved September 10, 2011 
  106. ^ "Nvidia Releases OpenCL Driver" Retrieved August 11, 2011 
  107. ^ a b Fang, Jianbin; Varbanescu, Ana Lucia; Sips, Henk 2011 A Comprehensive Performance Comparison of CUDA and OpenCL PDF Proc Int'l Conf on Parallel Processing doi:101109/ICPP201145 
  108. ^ Du, Peng; Weber, Rick; Luszczek, Piotr; Tomov, Stanimire; Peterson, Gregory; Dongarra, Jack 2012 "From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming" Parallel Computing 38 8: 391–407 doi:101016/jparco201110002 
  109. ^ Romain Dolbeau; François Bodin; Guillaume Colin de Verdière September 7, 2013 "One OpenCL to rule them all" Archived from the original on January 16, 2014 Retrieved January 14, 2014 
  110. ^ Karimi, Kamran; Dickson, Neil G; Hamze, Firas 2011 "A Performance Comparison of CUDA and OpenCL" arXiv:10052581v3 
  111. ^ A Survey of CPU-GPU Heterogeneous Computing Techniques, ACM Computing Surveys, 2015
  112. ^ Grewe, Dominik; O'Boyle, Michael F P 2011 A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL Proc Int'l Conf on Compiler Construction doi:101007/978-3-642-19861-8_16 

External linksedit

  • Official website for OpenCL
  • Official website for WebCL
  • Annual OpenCL Conference sponsored by The Khronos Group


OpenCL Information about

OpenCL

OpenCL
OpenCL

OpenCL Information Video


OpenCL viewing the topic.
OpenCL what, OpenCL who, OpenCL explanation

There are excerpts from wikipedia on this article and video