CUDA Python Specification (v0.2)

(This documents reflects the implementation of CUDA Python in NumbaPro 0.12. In time, we may refine the specification.)

As usage of Python on CUDA GPUs is becoming more mature, it has become necessary to define a formal specification for a dialect and its mapping to the PTX ISA. There are places where the semantic of this dialect differs from the Python semantic. The change in semantic is necessary for us to generate high-performance code that is otherwise hard to achieve.

No-Python Mode (NPM)

CUDA Python is a superset of the No-Python mode (NPM). NPM is a statically typed subset of the Python language. It does not use the Python runtime; thus, it only supports lower level types; such as booleans, ints, floats, complex numbers and arrays. It does not support Python objects. Since we drop the support for objects entirely, many basic language constructs must be handled differently. For instance, a simple for-loop is:

for i in range(10):

where range returns an iteratable. NPM restricts the language so that only range or xrange can be used.

For array support, NPM models the NumPy ndarray. An array is a structure with a pointer to the data, an array of shape and an array of strides. Valid attributes are shape, strides, size and ndim. Arrays cannot be unpacked. The only way to access array elements is through the __getitem__ and __setitem__ operators (e.g. ary[i, j]). Slicing is not supported. When indexing into an array, a N-dimension array must be provided with N indices.

Tuples are minimally supported for unpacking array shape and strides attributes and some return value of calls.

In time, we aim to enhance NPM to expand the supported subset and recognize more idiomatic Python patterns.


  • no object;
  • no exception;
  • for-loop only works on range or xrange;
  • supported types: ints, floats, complex numbers, and arrays.

Type Inference

The type inference algorithm for CUDA Python differs from Numba as we recognize that its users require stronger typing to better predict code performance. This is a summary of the type inference rules:

  • Implicit coercion for all ints and floats only.
  • Variable type is assigned at definition but a variable can be redefined; thus its type can be modified at the next assignment.
  • Inside a loop, the variable type remains unchanged even at redefinition. The type assigned at the preloop block (the dominator of the basic-block) is assumed. This greatly differs from Python semantic.

User can force the type of any value by using the type object defined in numbapro namespace:

from numbapro import cuda, int16, float32

def a_cuda_kernel(arg):
    must_be_int16 = int16(123)
    must_be_float32 = float32(321)

Basic Arithmetic Operations

For binary operators + - *, the operands are coerced to the most generic type of the two before the computation. The result type is the coerced type.

For floordiv //, the coercion rule on operands for basic binary operators
applies. But, the result type is always coerced to an integer of the same bitwidth and at least has 32-bits.

For truediv /, the operands are promoted to a floating point representation with bitwidth equals to the maximum of the two operands before the computation and at least has 32-bits. The result type is the coerced type.

For binary bitwise & | ^ >> <<, the operands must be of integer types and they are coerced to the most generic type of the two before the computation. The result type is the coerced type.

For complex numbers, only + - * are defined.

Please refer to the CUDA-C Programming Guide: Arithmetic Instructions for the precision each operation.


intp is used to represent the integer whose width equals the address width.

Array Operations

Array attributes are read-only:

  • shape contains the number of elements for each dimension. It can be indexed or unpacked like a tuple. It is a tuple of intp.
  • strides contains the number of bytes skip to move forward to the next element for a given dimension. It can be indexed or unpacked like a tuple. It is a tuple of intp.
  • size contains the number of elements in the array but may not be correspond to the actual size of the data buffer since strides can be zero or negative. It is an intp.
  • ndim contains the number of dimension in the array. It is an intp.

__getitem__ returns the element at the given index. Slicing or fancy indexing are not supported. The result type is always the same as the element type of the array.

__setitem__ stores a value into the array at an index. The value is coerced if necessary.

CUDA Intrinsics

All intrinsics are defined under the numbapro.cuda namespace.

Thread ID intrinsics:

  • cuda.threadIdx.x, cuda.threadIdx.y, cuda.threadIdx.z are the X, Y and Z IDs of the thread.
  • cuda.blockIdx.x, cuda.blockIdx.y are the X and Y ID of the block.
  • cuda.blockDim.x, cuda.blockDim.y, cuda.blockDim.z are the X, Y and Z width of the thread block.
  • cuda.gridDim.x, cuda.gridDim.y are the X and Y width of the grid.

Barrier intrinsics:

  • cuda.syncthreads() equivalent to __syncthreads() in CUDA-C. It is a thread block level barrier.

Shared memory intrinsics

  • cuda.shared.array(shape, dtype) constructs a statically allocated array in the shared memory of kernel. dtype argument must be a type object defined in the NumbaPro namespace. It must be declared in the entry block of the kernel.


Python Dialect for CUDA translates math functions defined in the math module of the Python standard library. All the functions use the semantic of the CUDA-C definition. Please refer to the CUDA-C Programming Guide: Math Function.

Supported functions:

math.expm1              # not available in Python 2.6

Fast Math

coming soon in the next release…