How can I force Apple's OpenCL compiler to recompile a cached kernel?

I want to use #include statements in my OpenCL kernels but it appears Apple's OpenCL compiler caches kernels, so if you change the contents of an included file but not the file doing the including, the program will not change between runs. I've coded up an example which illustrates this: http://github.com/enjalot/adventures_in_opencl/tree/master/experiments/inc/ If you compile and run, it should work fine. Then if you comment out the struct definition in inc.cl it will still run just fine (or

Why are bitfields not allowed in OpenCL?

Bitfields are not supported in the OpenCL language. What was the reason to not support them? Unlike in other pieces omited (recursion, pointers to functions, ...), where there is an obvious reason to not support them, I fail to see one for bitfields. I am sure that it is not an oversight on behalf of the commitee, but what is the reason? (I store some bits packed in ints, and the code would be nicer to read with them. I understand bitfields as nice syntax for avoiding bit-shifting and masking b

OpenCL: Manually throw an exception in kernel

Is it possible to manually throw an exception in OpenCL, just for debugging purposes? I am having a very strange error in my code: when I computed two double values and add them up, the host reports "CL_OUT_OF_RESOURCE". However if I don't add these two values, the host doesn't report any error.

Opencl Memory location and allocation

Ex: To perform an algorithm on an array, we must use a buffer created with an array. But with a Intel/AMD CPU, it use the DDR of the system like Global Memory. Finally, the table is created twice. Is there a way to use the table already in memory without allocating buffer.

Opencl Plotting points in 3D space

I'm looking to plot the points in a 3D space. I find that languages like python with matplotlib, take a lot of time. What would be the simple way to do it in opencl ?

Opencl Error CL_INVALID_PROGRAM_EXECUTABLE when calling clEnqueueNDRangeKernel

I have built a library that calls many opencl kernels. All kernels pass the following: -oclLoadProgSource -clCreateProgramWithSource -clBuildProgram -clCreateKernel The problem is, when I launch one of those kernels using clEnqueueNDRangeKernel, I get the following error : CL_INVALID_PROGRAM_EXECUTABLE I know that: 5 other kernels have been successfully launched before. When I use the source code directly (not via the library), I do not face such problem at all, and everything works f

Opencl Can you pass const unsigned int4* to a kernel?

I have instructions to use: __kernel void myKernel(__global const unsigned int4* data But I get CL_INVALID_PROGRAM_EXECUTABLE whenever I try to build it. However, both of these build without error: __kernel void myKernel(__global const int4* data __kernel void myKernel(__global const unsigned int* data

Depth Of Field in OpenCL

This might be a "homework" issue, but I think I did enough so I can get a help here. In my assignment, we have a working OpenGL/OpenCL application. OpenGL application renders a scene and OpenCL should apply depth-of-field like effect. OpenCL part gets texture where each pixel has original color and depth and should output color for given pixel. I'm supposed to only change per-pixel function, that is part of the OpenCL. I already have working solution using variable-size gausian filter, that sa

OpenCL Transfer rate exceed PCI-e Bandwidth

I made an OpenCL program and use pinned memory (CL_MEM_ALLOC_HOST_PTR) to get a higher transfer rate from device to host. The transfer rate is increased as I expected (get transfer rate using AMD APP Profiler 2.4). The problem is the transfer rate is higher than PCIe bandwidth (93703 GB /s) for matrix 4096 x 4096 (64 MB). It happened too when I use zero copy buffer ( CL_MEM_ALLOC_HOST_PTR + clEnqueueMapBuffer). I search some information that it is true if pinned memory and zero copy buffer ha

Opencl AMD Device Showing CPU Properties in clGetDeviceInfo

I have an AMD w7000 firepro Card installed. When i query its properties, instead of showing its own properties, it just shows the same properties as that of my CPU (Intel Xeon), with the exception of 3 properties, as shown : 1.cl_global_mem_cache_size 2.cl_max_threads_per_block. the way i query properties is i send the cl_device_id of all devices i find to a function get_prop(cl_device_id id) where i simply print all properties using clGetDeviceinfo. PLATFORM 1: NAME : Intel(R) OpenCL

How does a barrier work for OpenCl Kernel?

Kernel code: #pragma OPENCL EXTENSION cl_khr_fp64: enable #pragma OPENCL EXTENSION cl_amd_printf : enable __kernel void calculate (__global double* in) { int idx = get_global_id(0); // statement 1 printf("started for %d workitem\n", idx); // statement 2 in[idx] = idx + 100; // statement 3 printf("value changed to %lf in %d workitem\n", in[idx], idx); // statement 4 barrier(CLK_GLOBAL_MEM_FENCE); // statement 5 printf("completed for %d workitem\n", idx); // statement 6 }

Memory transfer between host and device in OpenCL?

Consider the following code which creates a buffer memory object from an array of double's of size size: coef_mem = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, (sizeof(double) * size), arr, &err); Consider it is passed as an arg for a kernel. There are 2 possibilities depending on the device on which the kernel is running: The device is same as host device The device is other than host device Here are my questions for both the possibilities: At what step is the

OpenCL kernel works with certain data types but not others

I'm trying to learn OpenCL, and currently I'm practising making different kernels. In my attempt to make a dot product using reduction methods, I ran into an issue I don't understand. When I run my code with int inputs and output, it works fine. When I change all the int types to float types (for the inputs and outputs) it gives me a result that is close but is slightly off. Can anyone figure out why this is or what is causing it? Here's the host code #include <stdio.h> #include <CL/c

Estimating OpenCL memory access performance for algorithm design

I have a task which I need to achieve using one of several possible algorithm. Each algorithm, has its own opportunities for local-memory optimization, and I would like to estimate which algorithm will perform best, based on counting compute operations and memory access. For the purpose of comparing different number of local memory access operations vs. global memory access operations, I would like to estimate the price (in cycles?) of local memory access (read / write) vs the price of global

Opencl How do I program an INTEL GPU

I am quite new in the world of GPU Computing. So I would really like someone to explain me the very basics. I have to Intel chipsets with the following GPUs: GMA4500 HD graphics I am interested in running algebraic and bitwise functions with huge data sets, like transpose of an array or bitwise shift of the lines of an array, in a GPU. The goal is of course to gain more performance. My main question is how can I program such on GPUs? In the past I have used CUDA to program on nVIDIA video

OpenCL - write_imagef Compilation failure : CL_INVALID_VALUE

If I run the code below (OpenCL C 1.1, JavaCL RC3) I got error Compilation failure : CL_INVALID_VALUE The strength thing about the code is that after replacing the line: write_imagef(output, (int2)(coords.y,coords.x), rnd); with write_imagef(output, (int2)(coords.y,coords.x), pixel); works perfectly. How can I initialize float4 struct correctly and assign a value to the output image? __kernel void rotate_image( __read_only image2d_t input, __write_only image2d_t

clCreateContext takes a lot of time in OpenCL

I am working on an openCL program, and everything is working good, but the problem is that clCreateContext function call takes most of the program execution time (the program runs in about 400 ms where 380 ms just for creating context). The kernel is compiled online after creating context and command queue. And my system contains just one OpenCL device (Nvidia Pascal). I tried the same program on Nvidia GeForce GT 640, and it takes less time for creating context (about 100 ms), but still too l

OpenCL reported device version different between clinfo / clGetDeviceInfo

I'm just trying to dive into OpenCL 2.0. I'm using an AMD R7 260X GPU with AMD APP SDK 3.0 (final) with most current driver (Crimson-something, 2348.4) on Win10-64 with 16GB RAM. Compiler is Visual Studio 2015. First thing I did was querying some information on my system with clInfo. Output was as expected, especially the device OpenCL C Version: Platform Name: AMD Accelerated Parallel Processing Number of devices: 2 Device Type:

Unable to profile OpenCL code using NVidia Visual Profiler

I have an OpenCL code which adds two arrays and prints the output. I want to profile this program using NVidia Visual Profiler that comes with CUDA Toolkit 3.0. I selected the appropriate program(.exe) to profile and the program directory. The profiler runs the code successfully but is unable to generate profiling results. It gives the error "Empty Header found in CSV file". What could be the problem for this? Is it necessary to build the code using NVidia's CUDA compiler to be able to profile?

OpenCL: basic questions about SIMT execution model

Some of the concepts and designs of the "SIMT" architecture are still unclear to me. From what I've seen and read, diverging code paths and if() altogether are a rather bad idea, because many threads might execute in lockstep. Now what does that exactly mean? What about something like: kernel void foo(..., int flag) { if (flag) DO_STUFF else DO_SOMETHING_ELSE } The parameter "flag" is the same for all work units and the same branch is taken for all work units. Now, is

OpenCL local memory size and number of compute units

Each GPU device (AMD, NVidea, or any other) is split into several Compute Units (MultiProcessors), each of which has a fixed number of cores (VertexShaders/StreamProcessors). So, one has (Compute Units) x (VertexShaders/compute unit) simultaneous processors to compute with, but there is only a small fixed amount of __local memory (usually 16KB or 32KB) available per MultiProcessor. Hence, the exact number of these multiprocessors matters. Now my questions: (a) How can I know the number of mu

Opencl Use GPU and CPU wisely

I'm newbie for OpenCL, just started learning. I wanted to know whether it is possible to execute few threads on GPU and remaining threads on CPU? In other words, if I launch 100 threads and assume that I've 8 core CPU then is it possible that 8 threads out of 100 threads will execute on CPU and remaining 92 threads will run on GPU?Can OpenCL help me to do this job smoothly?

Copy float2 values back from GPU in OpenCL

I would like to copy float2 values back to CPU. The results are correct in GPU side but some how the results are incorrect in CPU side. Can someone please help me GPU code #pragma OPENCL EXTENSION cl_amd_printf : enable __kernel void matM(__global float* input, int width, int height, __global float2* output){ int X = get_global_id(0); float2 V; V.x = input [X]; V.y = input [X]; output[X] = V; printf("%f\t %f\n",output[X].x,output[X].y); } CPU code outp

Is it possbile to use more than one property in clCreateCommandQueue in OpenCL

I am working with OpenCL. My tool is not generating kernel statistics when enabled OUT-OF-ORDER EXEC MODE. So i decided to use enable profiling in clCreateCommandQueue but, later i realize that how to use two properties at the same time? What i have to do now, i want to run in asynchronous(out-of-orderexecution) mode with profiling enable.

OpenCL clEnqueueCopyImageToBuffer with stride

I have an OpenCL buffer containing an 2D image. This image have stride bigger than its width. I need to make OpenCL image from this buffer. The problem is that function clEnqueueCopyImageToBuffer does not contain stride as an input parameter. Is it possible to make OpenCL image from OpenCL buffer(with stride bigger than width), with only one copying or faster? The one way to solve this problem is to write own kernel, but maybe there are much more neat solutions?

Opencl AMD GPUs Dynamic Parallelism

Does AMD GPUs support dynamic parallelism ? If so kindly let me know the GPU model details and I know OpenCL 2.0 supports Dynamic parallelism. Thanks!

OpenCL: Thread-block algebra for intel HD graphics

I have some background on NVIDIA, and so to learn OpenCL for Intel, I would like to correlate. In case of Nvidia, we have following rules : 1- Warp size: 32 (or in some cases 64) 2- Maximum no. of resident blocks per multiprocessor: 8 3- Maximum no. of threads that can be resident on a Multiprocessor: 768 ( in older cards) 4- Amount of shared memory available per workgroup: 64 KB (48 + 16 KB ) 5- No. of threads per workgroup: 512 (on latest cards it is 1024) 6- A workgroup runs onl

OpenCL: maintaining separate version of kernels

The Intel SDK says: If you need separate versions of kernels, one way to keep the source code base same, is using the preprocessor to create CPU-specific or GPU-specific optimized versions of the kernels. You can run clBuildProgram twice on the same program object, once for CPU with some flag (compiler input) indicating the CPU version, the second time for GPU and corresponding compiler flags. Then, when you create two kernels with clCreateKernel, the runtime has two different ve

Compiling OpenCL 1.2 codes on nvidia gpus

I am going to compile a chunk of codes which needs OpenCL 1.2. As I understood, nVIDIA has released OpenCL 1.2 driver. I have installed the latest CUDA Toolkit which is version 7.0. But when I compiled then code, I got errors like: Error 9 error LNK2001: unresolved external symbol clReleaseDevice C:\Users\???\Downloads\FireRaysSDK-1.0\FireRaysSDK-1.0\App\CLW64.lib(CLWParallelPrimitives.obj) App Error 7 error LNK2001: unresolved external symbol clRetainDevice C:\Users\???\Downloads\

OpenCL Intel Iris Integrated Graphics exits with Abort Trap 6: Timeout Issue

I am attempting to write a program that executes Monte Carlo simulations using OpenCL. I have run into an issue involving exponentials. When the value of the variable steps becomes large, approximately 20000, the calculation of the exponent fails unexpectedly, and the program quits with "Abort Trap: 6". This seems to be a bizarre error given that steps should not affect memory allocation. I have tried setting normal, alpha, and beta to 0 but this does not resolve the problem however commenting o

OpenCL: How would one split an existing buffer into two?

Lets say that I happen to allocate some OpenCL memory as such with 200 float values. cl::Buffer newBuf = cl::Buffer(op::CLManager::getInstance(gpuID)->getContext(), CL_MEM_READ_WRITE, sizeof(float) * 200); Now I would like to split this cl::Buffer into two objects, one with the first 100 float objects, and another with the subsequent, so that I can pass them into two kernels. I can't find any resource that explains how to do this. I have no choice because a library that I am using returns

OpenCL - Storing a large array in private memory

I have a large array of float called source_array with the size of around 50.000. I am current trying to implement a collections of modifications on the array and evaluate it. Basically in pseudo code: __kernel void doSomething (__global float *source_array, __global boolean *res. __global int *mod_value) { // Modify values of source_array with mod_value; // Evaluate the modified array. } So in the process I would need to have a variable to hold modified array, because source_array

Opencl GPU Driver does not respond after NDRangekernel increase

i am new to opencl and i want to actually parallelise this Sieve Prime, the C++ code is here: https://www.geeksforgeeks.org/sieve-of-atkin/ I somehow don't get the good results out of it, actually the CPU version is much faster after comparing. I tried to use NDRangekernel to avoid writing the nested loops and probably increase the performance but when i give higher limit number in function, the GPU driver stops responding and the program crashes. Maybe my NDRangekernel config is not ok, anyone

Does OpenCL itself have FPGA backend for HDL?

I wonder who supports FPGA HDL backend for OpenCL. I thought that altera/intel and xilinx provide compiler for OpenCL to generate HDL backend. But, does OpenCL framework itself provides HDL backend? If I'm right, this is not possible, because FPGAs have unique options depending on which board we use.

Opencl Load SPIR binary with clBuildProgram on Windows

I am trying to load a SPIR binary i created with clang+llvm 6.0.1. Created a few different files with : clang -target spir-unknown-unknown -cl-std=CL1.2 -c -emit-llvm -Xclang -finclude-default-header OCLkernel.cl clang -target amdgcn-amd-amdhsa -cl-std=CL1.2 -c -emit-llvm -Xclang -finclude-default-header OCLkernel.cl clang -cc1 -emit-llvm-bc -triple spir-unknown-unknown -cl-std=CL1.2 -include "include\opencl-c.h" OCLkernel.cl This is all happening on windows, installed AMD APP SDK 3 and

Opencl GPU optimization pass in LLVM

Is adding an optimization pass for AMD OpenCL any different from writing an LLVM pass as in Writing an LLVM Pass. What additional knowledge should I have to accomplish this?. Do we need some extra libraries to optimize the OpenCL kernel?

OpenCL - how to spawn a separate math process on each core

I am new to OpenCL and I am writing an RSA factoring application. Ideally the application should work across both NV and AMD GPU targets, but I am not finding an easy way to determine the total number of cores/stream procs on each GPU. Is there an easy way to determine how many total cores/stream procs there are on any hardware platform, and then spawn a factoring thread on each available core? The target RSA modulus would be in shared memory, and with each factoring thread using a Rho factor

Difference between diff and abs_diff in openCL

this may be silly question but yet i am unable to figure it out... syntax of abs and abs_diff is ugentype abs (gentype x) ugentype abs_diff (gentype x,gentype y) let's take x=-4 and y=3 is there any difference between abs(-4-3) and abs_diff(-4,3) the result of both operation is same... if i can rewrite abs_diff as abs then why khronos gave 2 abs function thank you

OpenCL - Releasing platform object

I was studying OpenCL releasing functions ( clRelease(objectName) ) and it was interesting for me that there was no function to release Platform (more specifically, cl_platform_id) objects. Does anybody know the reason ?

What was the real reason why Google is chosing RenderScript instead of OpenCL?

The question has been asked before in a slightly different form, but I'd like to know what Android-developers think what's really behind Google's decision and not what Google's official answer is. OpenCL is an open standard and works on various devices, such as CPUs, desktop GPUs, ARM processors, FPGAs and DSPs. It gives us developers the convenience of creating high performance software and libraries, which works on all devices. RenderScript is a higher level language, which focuses mainly on

Is there a maximum limit to private memory in OpenCL?

Does the OpenCL specification set any maximum limit on the amount of private memory that can be used? If so, how do I get this number? I have a function which gives the correct result when run outside OpenCL, but when converted to a kernel, it spews out garbage. I checked the amount of private memory being used per work item using the CL_KERNEL_PRIVATE_MEM_SIZE flag and it is ~ 4000 bytes. I suspect that I am using too much private memory and this is somehow leading to junk computation.

Data sharing between CPU and GPU on modern x86 hardware with OpenCL or other GPGPU framework

Progressing unification of CPU and GPU hardware, as evidenced by AMD Kaveri with hUMA (heterogeneous Uniform Memory Access) and Intel 4th generation CPUs, should allow copy-free sharing of data between CPU and GPU. I would like to know, if the most recent OpenCL (or other GPGPU framework) implementations allow true copy-free sharing (no explicit or implicit data copying) of large data structure between code running on CPU and GPU.

OpenCL: Kernel Code?

I am working on optimization of the ADAS algorithm which are in c++. I want to optimize that algorithm using OpenCL tech. I have gone through some basic doc of OpenCL. I came to know the kernel code is written in C which is doing the optimization. But I want to know how internally kernel is splitting the work into different workitems ? How is the single statement is doing for loop task. Please share your knowledge with me on OpenCL. Tr, Ashwin

OpenCL timeout on beignet doesnt raise error?

I run the following (simplified) code, which runs a simplified kernel for a few seconds, and then checks the results. The first 400,000 or so results are correct, and then the next are all zero. The kernel should put the same value (4228) into each element of the output array of 4.5 million elements. It looks like somehow, somewhere, something is timing out, or not being synchronized, but I'm a bit puzzled, since I: even called clFinish, just to make sure am checking all errors, and no error

OpenCL: Creating image2d_t from float* buffer

I'm trying to learn OpenCL with examples available from online. I wrote a HelloWorld matrix multiplication program to multiply float* A & float* B OpenCL device matrices. Now I would like to change float* B to image2d_t B and compare the execution speed with the former HelloWorld code. Is it possible to create an image2d_t object from an float* object in OpenCL? Note:The float* array's are stored in row-major format. Edit: Adding source code: Based on @Dithermaster's suggestion, I've c

Opencl How to run two work groups per one compute unit on AMD GCN cards

Usually one compute unit can only run one work group. But AMD's doc says there can be more than one wavefronts running on the same compute unit. How can I do that? Is that an OpenCL function for that? Or I need to use assembly instruction? I want to do this because my work group size is 20 and I want to run 2 work groups per compute unit, so that each group can use 32 KiB LDS (64 KiB total per CU, each wavefront can use up to 32KiB so I want to run two wavefronts to use the full amount of LDS).

OpenCL - dynamic shared memory allocation

I am trying to translate some existing CUDA kernels to OpenCL and the problem is that I am bound to use OpenCL 1.2, so it is not possible to use non-uniform work-group size, meaning that I should let enqueueNDRangeKernel decide local work-group size(to avoid non-divisible workgroup size w.r.t. global work size). As it is mentioned in this presentation, I use __local int * which is an argument of kernel function as shared memory pointer with the size that is defined in the host code using the &l

OpenCL using uchar* instead of image2d_t

First of all... I am no expert in OpenCL. I am using 2 kernels. The output of the first kernel is image2d_t but the input of the second kernel is " __global const uchar* source". __kernel void firstKernel(__read_only image2d_t input, __write_only image2d_t output) {...} __kernel void secondKernel( __global const uchar* source,...) {...} How to use the second kernel with that input?

  1    2   3   4   5   6  ... 下一页 最后一页 共 8 页