Cuda Feasibility of GPU as a CPU?

What do you think the future of GPU as a CPU initiatives like CUDA are? Do you think they are going to become mainstream and be the next adopted fad in the industry? Apple is building a new framework for using the GPU to do CPU tasks and there has been alot of success in the Nvidias CUDA project in the sciences. Would you suggest that a student commit time into this field?

Cuda copy data which is allocated in device from device to host

I have a pointer which is dynamically allocated in device,then how can I copy it from device to host. #include <stdio.h> #define cudaSafeCall(call){ \ cudaError err = call; \ if(cudaSuccess != err){ \ fprintf(stderr, "%s(%i) : %s.\n", __FILE__, __LINE__, cudaGetErrorString(err)); \ exit(EXIT_FAILURE); \ }} #define cudaCheckErr(errorMessage) { \ cudaError_t err = cudaGetLastError(); \ if(cudaSuccess != err){ \ fprintf(stderr, "%

Calculate Thread ID in CUDA

I write my code, and I use one block of size 8*8. I use this formula to define the index of a matrix: int idx = blockIdx.x * blockDim.x + threadIdx.x; int idy = blockIdx.y * blockDim.y + threadIdx.y; And to check it, I put the idx and idy in a 1D array, so I can copy it to host to print it out. if (idx<N && idy<N) { c[idx]=idx; d[idx]=idy; }//end if The strange thing is that idy always give me 3! Can anyone help to resolve it?

Using CUDA in Ubuntu 11.10

Starting development of CUDA-based tools on Ubuntu, and tried to install/use the SDK. However, deviceQuery gives CUDA driver version is insufficient for CUDA runtime version For reference: # nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2011 NVIDIA Corporation Built on Tue_Oct_18_17:35:10_PDT_2011 Cuda compilation tools, release 4.1, V0.2.1221 # cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 280.13 Wed Jul 27 16:53:56 PDT 2011 GC

CUDA-NPP sample code

Can anyone tell me how to compile the boxFilter program found on the CUDA-NPP sample code site ? 'make' gives an error about common_npplib.mk - I don't find common_npplib.mk but it is included in the makefile. Anyway, I tried this : g++ -I../../common/UtilNPP -I../../../shared/inc -I../../common/FreeImage -I/usr/global/cuda/4.0/cuda/include -L/usr/global/cuda/4.0/cuda/lib64 -L../../common/FreeImage/lib/linux -L../../../shared/lib -lnpp -lcudart -lUtilNPP_x86_64 -lfreeimage64 -o bf boxFilterNPP.

Cuda Why do GPU based algorithms perform faster

I just implemented an algorithm on the GPU that computes the difference btw consecutive indices of an array. I compared it with a CPU based implementation and noticed that for large sized array, the GPU based implementation performs faster. I am curious WHY does the GPU based implementation perform faster. Please note that i know the surface reasoning that a GPU has several cores and can thus do the operation is parallel i.e., instead of visiting each index sequentially, we can assign a thread

stack smashing detected in bandwidthTest.cu in CUDA SDK

I want to run the bandwidthTest inside the CUDA SDK. It is terminated by stack smashing detected error. How can I solve this problem????? I use the make command to run this program and make the file. I cannot change anything inside the code.

cudaMemcpy & blocking

I'm confused by some comments I've seen about blocking and cudaMemcpy. It is my understanding that the Fermi HW can simultaneously execute kernels and do a cudaMemcpy. I read that Lib func cudaMemcpy() is a blocking function. Does this mean the func will block further execution until the copy has has fully completed? OR Does this mean the copy won't start until the previous kernels have finished? e.g. Does this code provide the same blocking operation? SomeCudaCall<<<25,34>>

CUDA sincospi function precision

I was looking all over and I couldn't find how the function computes or uses it's PI part. For my project I am using a defined constant that has precision of 34 decimal places for PI. However, this is much more than the normal math.h defined constant for PI which is 16 decimal places. My question is how precise of PI is sincospi using to compute its answer? Is it just using the PI constant from math.h?

How to use L2 Cache in CUDA

I have searched other threads on usage of L2 cache in CUDA. But, unable to find the solution. How do i make use of L2 Cache? Is there any invoking function or declaration for its use? Like, for using shared memory, we use __device__ __shared__. Is there anything like that for L2 Cache??

Cuda Modifying the content of Texture array on the fly

I want to modify the cuda array's content periodically, to which i have a texture reference in the device code. Note that update on array is to be done in the host code. My question is: can we do this concurrently, that is the device kernel is to be invoked only once and array content changes periodically and are reflected in the device memory.

Cuda Why does changing a kernel parameter deplete my resources?

I made a very simple kernel below to practice CUDA. import pycuda.driver as cuda import pycuda.autoinit import numpy as np from pycuda.compiler import SourceModule from pycuda import gpuarray import cv2 def compile_kernel(kernel_code, kernel_name): mod = SourceModule(kernel_code) func = mod.get_function(kernel_name) return func input_file = np.array(cv2.imread('clouds.jpg')) height, width, channels = np.int32(input_file.shape) my_kernel_code = """ __global__ void my_kernel(int width,

Cuda diffrent declarations between sdk device query sample and occupancy calculator?

At the moment I try to get a better occupancy for my kernel and use the occupancy calculator and the device informations that I get from the sdk sample devicequery. I'm wondering of a slightly different declaraion of blocks and streaming multiprocessor (sm). In the sdk sample it's called total amount of shared memory per block and total number of registers available per block But in the occupancy calculator these informations are per sm, which makes more sense to me. Is that only

Cuda5 and CMake

I have been working on getting some of the simple Cuda 5.0 samples to work on Windows 7 with CMake. I'm running Windows 7 64-bit and Visual C++ Express 2010. I had this working the other day and now it fails to compile, and I'm not quite sure why. I generally do my coding in *nix environments so windows development environments aren't a strong area for me. If anyone can help save me from pulling out more hair it'd be greatly appreciated. The example is just the simple vectorAdd.cu file that

separate compilaton in CUDA

System specs: laptop with nvidia optimus support (geforce 740m, supports compute capability 2.0), ubuntu 13.10, cuda 5.0, optirun (Bumblebee) 3.2.1. Im' trying to compile and run simpler version of example described here: main.cu #include "defines.h" #include <cuda.h> int main () { hello<<<1, 1>>>(); cudaDeviceSynchronize(); } defines.h #include <cuda.h> extern __global__ void hello(void); defines.cu #include <cstdio> #include <cuda.h>

CUDA - more SM or higher clock rate?

What counts more when CUDA kernel speed execution is of vital importance? The frequency of the cores or the number of the SMs? I can choose between a Quadro K5000 and a Gtx 670 and I cannot decide. Memory seems enough in both cases but the quadro has more SMs while the Gtx has a higher clock rate (I suppose this value is per-core).

Cuda Mirror reordering in Thrust

I'm using thrust vector. I'm looking for an elegant method for reordering a thrust device vector using a "mirror" ordering, (example given, couldn't find any function for that in Thrust ) For instance, Let's say my vector contain a struct, each struct contains several numbers. my vector looks like the following [1,2] [5,4] [-2,5] [6,1] [2,6] after mirror reordering operation I'd like to receive the following vector (the 1st element switched with the n-th element) (th

Cuda GPU coalesced global memory access vs using shared memory

If a thread is accessing global memory, why does it access a large chunk? Where is this large chunk stored? If your reading from global memory in a coalesced manner, would it be beneficial to copy a common chunk of the global memory into shared memory, or would there not be any improvement. ie: If each thread is reading the next 5 or 10 or 100 memory locations, and averaging them, if you could fit a chunk of X points from global memory into shared memory, could you not write an if statement s

Cuda How to resize the device_vector after I use the unique_by_key?

This is the code I used. But it can not work. something wrong with the new_end; thrust::device_vector<int> keys; thrust::device_vector<int> values; // after initialization. pair<int*, int*> new_end; new_end = thrust::unique_by_key(keys.begin(), keys.end(), values.begin()); keys.resize(thrust::distance(keys.begin,new_end.first)); values.resize(thrust::distance(values.begin(), new_end.right));

Cuda CUSPARSE internal format conversion in csrmv/csrmm

I’m using CUSPARSE functions to perform sparse matrix-vector/matrix-matrix multiplications. Sparse matrices are stored in CSR format. While profiling the app under Visual Profiler, I’ve noticed that for each call to cusparse(S/D)csrmv or cusparse(S/D)csrmm there is a memory allocation/memset/copy. By looking at kernel names in the profiler, it looks like CUSPARSE converts matrix from CSR format to HYB format on each call which is a waste of time in my case as I could create matrix in the rig

CUDA Constant Memory Best Practices

I present here some code __constant__ int array[1024]; __global__ void kernel1(int *d_dst) { int tId = threadIdx.x + blockIdx.x * blockDim.x; d_dst[tId] = array[tId]; } __global__ void kernel2(int *d_dst, int *d_src) { int tId = threadIdx.x + blockIdx.x * blockDim.x; d_dst[tId] = d_src[tId]; } int main(int argc, char **argv) { int *d_array; int *d_src; cudaMalloc((void**)&d_array, sizeof(int) * 1024); cudaMalloc((void**)&d_src, sizeof(int) * 1024); int *te

Profiling CUDA code: Unexpected instruction counts on coalesced memory reads

I am profiling a very dump sorting algorithm for small input data (= 512 elements). I am invoking a kernel that reads coalesced form an array of structs. The struct looks like this: struct __align__(8) Elements { float weight; int value; }; The nvprof delivers the following instruction counts for L1 miss/hits and gdl instructions: Invocations Avg Min Max Event Name Kernel: sort(Elements*) 500 0 0

Cuda Random no generation Vs Hashing inside a kernel

Concerning some random no. generation I have choices as follows: 1- Generate Random nos. on GPU and use in Kernel 2- Generate Random nos. on CPU and send in Kernel via PCI-e 3- Generate random nos using Hashing function written inside the kernel How do I decide which is best one ? Any general guidelines?

Cuda Solving collisions - try to coalesce gmem access, using smem, but banks conflicts

I have that code: struct __declspec(align(32)) Circle { float x, y; float prevX, prevY; float speedX, speedY; float mass; float radius; void init(const int _x, const int _y, const float _speedX = 0.0f, const float _speedY = 0.0f, const float _radius = CIRCLE_RADIUS_DEFAULT, const float _mass = CIRCLE_MASS_DEFAULT); }; And the second one: /*smem[threadIdx.x] = *(((float*)cOut) + threadIdx.x); smem[threadIdx.x + blockDim.x] = *(((float*)cOut) + thread

Cuda Thrust zip_iterator - is typedef essential?

I tried to do this: thrust::zip_iterator<IteratorTuple> zip; zip = make_zip_iterator(...) That failed to compile, but when I did this: typedef thrust::zip_iterator<IteratorTupe> ZipIterator; ZipIterator zip = make_zip_iterator(...) , my code compiled and did exactly what I wanted. My question is, why was the typedef required in this case? And is this usage of typedef specific to this context? I can post the rest of my code if somebody thinks the problem might have been elsewher

Cuda How to use thrust min_element algorithm without memcpys between device and host

I am optimising a pycuda / thrust program. In it, I use thrust::min_element to identify the index of the minimum element in an array that is on the device. Using Nvidia's visual profiler, it appears that whenever I call thrust::min_element, there is a DtoH (device to host) memcpy. What I would like is for everything to be conducted only on the device. In other words, the output of min_element() should be stored on the device, where I can use it later, without suffering the cost of the small D

A simple Code about CUDA Warps

I have read in the Cuda Documentaion that , inside each block threads are executed in a batch of 32 called warps, each thread points at same instruction but multiple data can be accessed, my quest was to test out the authenticity of the statement. Now what i did is i launched a kernel with 256 threads and a single block, so 8 batches of warps must be executed. I shall create a shared variable of size 32, assign it to sharedVariable [ threadIdx.x % 32 ] = threadIdx.x /32; and then assign th

Cuda Cufft error in file

I am receiving the error: Cufft error in file I am using this file in order to load the FFT and pass them to another file. //----function to check for errors------------------------------------------------- #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); } inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true) { if (code != cudaSuccess) { fprintf(stderr,"\nGPUassert: %s %s %d\n", cudaGetErrorString(code), file, line); if (abort) ex

Cuda Nsight report: No kernel launches captured

I wrote a simple cuda program in a .cu file. When I want to see the performance of this program. I choose "Nsight->Start Performance Analysis...." Then choose "Profile CUDA Application". After launching the application for a while and finishing capture, the report say "No kernel launches captured" The summary report say" 1 error encountered". Can someone help me to figure out why this happened?

CUDA initialization error after fork

I get "initialization error" after calling fork(). If I run the same program without the fork, all works fine. if (fork() == 0) { ... cudaMalloc(....); ... } What would cause this? A complete example is below. If I comment out the cudaGetDeviceCount call, it works fine. #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <sys/types.h> #include <sys/wait.h> #include <cuda_runtime.h> #define PERR(call) \ if (call) {\ fpri

Cuda Is sort_by_key in thrust a blocking call?

I repeatedly enqueue a sequence of kernels: for 1..100: for 1..10000: // Enqueue GPU kernels Kernel 1 - update each element of array Kernel 2 - sort array Kernel 3 - operate on array end // run some CPU code output "Waiting for GPU to finish" // copy from device to host cudaMemcpy ... D2H(array) end Kernel 3 is of order O(N^2) so is by far the slowest of all. For Kernel 2 I use thrust::sort_by_key directly on the device: thrust::d

cudaEventSynchronize from multiple threads

If I call cudaEventSynchronize for different events in different threads, will each thread wait for the corresponding event independently? I mean, when one event finishes, the corresponding thread will be allowed to proceed while other threads may still be waiting.

Compress "sparse data" with CUDA (CCL: connected component labeling reduction)

I have a 5 million list of 32 bit integers (actually a 2048 x 2560 image) that is 90% zeros. The non-zero cells are labels (e.g. 2049, 8195, 1334300, 34320923, 4320932) that completely not sequential or consecutive in any way (it is the output of our custom connected component labeling CCL algorithm). I am working with a NVIDA Tesla K40, so I would love it if this needs any prefix-scan work, that it uses SHUFFLE, BALLOT or any of the higher CC features. I don't need a full worked out example, j

What's wrong with this simple cuda program?

I'm just trying to use CUDA to blank an image. But "before" and "after" I get the same original image. Can't figure out the problem. sumKernel.cu: #include "sumKernel.h" __global__ void _sumKernel(char *image, int width, int height, char *kernel, int kerwidth, int kerheight) { int idx = blockIdx.x * blockDim.x + threadIdx.x; image[idx] = 0; } void sumKernel(char *image, int width, int height, char *kernel, int kerwidth, int kerheight) { dim3 blocks(1); dim3 threads(width*heig

Timing CUDA Streams

I'm having some trouble understanding this code (running on non-Hyper-Q compatible GPU): CHECK(cudaEventRecord(start, 0)); // dispatch job with depth first ordering for (int i = 0; i < n_streams; i++) { kernel_1<<<grid, block, 0, streams[i]>>>(); kernel_2<<<grid, block, 0, streams[i]>>>(); kernel_3<<<grid, block, 0, streams[i]>>>(); kernel_4<<<grid, block, 0, streams[i]>>>(); } // record stop event CHEC

CUDA linking error: undefined reference to `cyl_bessel_i0f'

Why am I getting the following linker error with the example program below? test_cyl_bessel_i0f.o: In function `main': tmpxft_00007f3f_00000000-4_test_cyl_bessel_i0f.cudafe1.cpp:(.text+0x26): undefined reference to `cyl_bessel_i0f' collect2: error: ld returned 1 exit status I'm using the following commands to compile and link the code: nvcc -I/usr/local/cuda/include -c test_cyl_bessel_i0f.cu nvcc -L/usr/local/cuda/lib64 -o test_cyl_bessel_i0f test_cyl_bessel_i0f.o -lcudart The example prog

Cuda NVML code doesn't compile

I am implementing an example program with nvml library as shown at https://devtalk.nvidia.com/default/topic/504951/how-to-call-nvml-apis-/ The program is as follows: #include <stdio.h> #include <nvidia/gdk/nvml.h> const char * convertToComputeModeString(nvmlComputeMode_t mode) { switch (mode) { case NVML_COMPUTEMODE_DEFAULT: return "Default"; case NVML_COMPUTEMODE_EXCLUSIVE_THREAD: return "Exclusive_Thread"; case

Construct binary tree recursively in cuda

I want to build a binary tree in a vector s.t. parent's value would be the sum of its both children. To recursively build the tree in C would look like: int construct(int elements[], int start, int end, int* tree, int index) { if (start == end) { tree[index] = elements[start]; return tree[index]; } int middle = start + (end - start) / 2; tree[index] = construct(elements, start, middle, tree, index*2) + construct(elements, middle, end, tree, inde

AES decryption using CUDA

For a project that I'm working on, I'm supposed to brute force decrypt an AES-encrypted ciphertext given a portion of the key. The remaining keyspace for the ciphertext is 2^40. I'd like to run the decryption using CUDA (divide the keyspace over the GPU cores), but I can't seem to find a suitable CUDA AES library. I was wondering if there might be ways around this, such as running a C AES library decrypt in a kernel. Looking at this question suggests that this may not be possible. Another opt

CUDA Kernel does not launch, CMAKE Visual Studio 2015 project

I have a relatively simple CUDA kernel and I immediately call the kernel in the main method of my program in the following way: __global__ void block() { for (int i = 0; i < 20; i++) { printf("a"); } } int main(int argc, char** argv) { block << <1, 1 >> > (); cudaError_t cudaerr = cudaDeviceSynchronize(); printf("Kernel executed!\n"); if (cudaerr != cudaSuccess) printf("kernel launch failed with error \"%s\".\n", cuda

CUDA: Forgetting kernel launch configuration does not result in NVCC compiler warning or error

When I try to call a CUDA kernel (a __global__ function) using a function pointer, everything appears to work just fine. However, if I forget to provide launch configuration when calling the kernel, NVCC will not result in an error or warning, but the program will compile and then crash if I attempt to run it. __global__ void bar(float x) { printf("foo: %f\n", x); } typedef void(*FuncPtr)(float); void invoker(FuncPtr func) { func<<<1, 1>>>(1.0); } invoker(bar); cudaDevi

Cuda Is NVRTC unavailable for Win32?

I'm running Python27 x32 and getting this error: Could not load "nvrtc64_75.dll": %1 is not a valid Win32 application. I've also tried with cuda8. As I realized, NVRTC docs list x64 as a requirement: NVRTC requires the following system configuration: Operating System: Linux x86_64, Linux ppc64le, Linux aarch64, Windows x86_64, or Mac OS X. (nvrtc64_75.dll really does have 0x8664 in IMAGE_FILE_HEADER and 0x20b (pe32+) magic.) I'm trying to use libgpuarray's pygpu with theano and I've

Cuda Converting Thrust device iterators to raw pointers

I'm considering the following simple code in which I'm converting thrust::host_vector<int>::iterator h_temp_iterator = h_temp.begin(); and thrust::device_vector<int>::iterator d_temp_iterator = d_temp.begin(); to raw pointers. To this end, I'm passing &(h_temp_iterator[0]) and &(d_temp_iterator[0]) to a function and a kernel, respectively. The former (CPU case) compiles, the latter (GPU case) not. The two cases should be in principle symmetric, so I do not understand the rea

How to pass an array of vectors to cuda kernel?

I now have thrust::device_vector<int> A[N]; and my kernel function __global__ void kernel(...) { auto a = A[threadIdx.x]; } I know that via thrust::raw_pointer_cast I could pass a device_vector to kernel. But how could I pass an array of vector to it?

Unable to call CUDA half precision functions from the host

I am trying to do some FP16 work that will have both CPU and GPU backend. I researched my options and decided to use CUDA's half precision converter and data types. The ones I intent to use are specified as both __device__ and __host__ which according to my understanding (and the official documentation) should mean that the functions are callable from both HOST and DEVICE code. I wrote a simple test program: #include <iostream> #include <cuda_fp16.h> int main() { const float a =

Cuda CUSPARSE tridiagonal solver `cusparseDgtsv` is slow

I'm solving a system of linear equations with dedicated solver cusparseDgtsv() from CUSPARSE library and find it produce no acceleration. I tried running tests at: Device 0: "Tesla K40s" CUDA Driver Version / Runtime Version 9.2 / 9.1 CUDA Capability Major/Minor version number: 3.5 Itel Xeon E5-2697 v3 2.60GHz The next test code i compile with nvcc -lcusparse main.cu -o dgtsv.app -gencode arch=compute_35,code=sm_35 for Tesla K40s. #include <stdio.h> #include <stdlib.h&

CUDA - dynamically reallocate more global memory in Kernel

I have a question about the following task: "Given a two-dimensional array "a[N][M]" so N lines of length M. Each element of the array contains an random integer value between 0 and 16. Write a kernel "compact(int *a, int *listM, int *listN)" that consists of only one block of N threads, and each thread counts for one line of the array how many elements have a value of 16. The threads write these numbers into an array "num" of length N in shared memory, and then (after a barrier) one of the th

CUDA shared vs global memory, possible speedup

I believe my CUDA application could potentially benefit from shared memory, in order to keep the data near the GPU cores. Right now, I have a single kernel to which I pass a pointer to a previously allocated chunk of device memory, and some constants. After the kernel has finished, the device memory includes the result, which is copied to host memory. This scheme works perfectly and is cross-checked with the same algorithm run on the CPU. The docs make it quite clear that global memory is much s

  1    2   3   4   5   6  ... 下一页 最后一页 共 25 页