Writing CUDA-Python¶
The CUDA JIT is a low-level entry point to the CUDA features in NumbaPro. It translates Python functions into PTX code which execute on the CUDA hardware. The jit decorator is applied to Python functions written in our Python dialect for CUDA. NumbaPro interacts with the CUDA Driver API to load the PTX onto the CUDA device and execute.
Imports¶
Most of the CUDA public API for CUDA features are exposed in the
numbapro.cuda
module:
from numbapro import cuda
Compiling¶
CUDA kernels and device functions are compiled by decorating a Python function with the jit or autojit decorators.
Thread Identity by CUDA Intrinsics¶
A set of CUDA intrinsics is used to identify the current execution thread. These intrinsics are meaningful inside a CUDA kernel or device function only. A common pattern to assign the computation of each element in the output array to a thread.
For a 1D grid:
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
i = tx + bx * bw
array[i] = something(i)
For a 2D grid:
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
bw = cuda.blockDim.x
bh = cuda.blockDim.y
x = tx + bx * bw
y = ty + by * bh
array[x, y] = something(x, y)
Since these patterns are so common, there is a shorthand function to produce the same result.
For a 1D grid:
i = cuda.grid(1)
array[i] = something(i)
For a 2D grid:
x, y = cuda.grid(2)
array[x, y] = something(x, y)
Memory Transfer¶
By default, any NumPy arrays used as argument of a CUDA kernel is transferred automatically to and from the device. However, to achieve maximum performance and minimizing redundant memory transfer, user should manage the memory transfer explicitly.
Host->device transfers are asynchronous to the host. Device->host transfers are synchronous to the host. If a non-zero CUDA stream is provided, the transfer becomes asynchronous.
The following are special DeviceNDArray factories:
Memory Lifetime¶
The live time of a device array is bound to the lifetime of the DeviceNDArray instance.
CUDA Stream¶
A CUDA stream is a command queue for the CUDA device. By specifying a stream, the CUDA API calls become asynchronous, meaning that the call may return before the command has been completed. Memory transfer instructions and kernel invocation can use CUDA stream:
stream = cuda.stream()
devary = cuda.to_device(an_array, stream=stream)
a_cuda_kernel[griddim, blockdim, stream](devary)
cuda.copy_to_host(an_array, stream=stream)
# data may not be available in an_array
stream.synchronize()
# data available in an_array
An alternative syntax is available for use with a python context:
stream = cuda.stream()
with stream.auto_synchronize():
devary = cuda.to_device(an_array, stream=stream)
a_cuda_kernel[griddim, blockdim, stream](devary)
devary.copy_to_host(an_array, stream=stream)
# data available in an_array
When the python with
context exits, the stream is automatically synchronized.
Synchronization Primitives¶
We currently support cuda.syncthreads()
only. It is the same as __syncthreads()
in CUDA-C.