Version (0.21.0)

NumbaPro will be deprecated with most code generation features moved into the opensource Numba and the CUDA bindings moved into a new commerical package called “Accelerate”. The new package will feature more high-level API functions from the CUDA libraries as well as MKL.

The next release of NumbaPro will provide aliases to the features that are moved to Numba and Accelerate. There will not be any new feature added to NumbaPro. In the future, there maybe bug fix releases for maintaining the aliases to the moved features.


  • Depends on numba 0.21.0
  • Fix auto thread-per-block tuning support for CUDA CC 3.7 devices
  • Blas.dotu is deprecated. A warning is generated when it is used. is an alias to it and is preferred.

Version (0.20.0)

This release depends on numba 0.20, which has upgraded to CUDA 7 for GPU support. CUDA 7 has deprecated the support for all 32-bit platforms. The oldest supported Windows version is Windows 7.0. This does not affect CPU features.

Version (0.19.0)

  • Depends on numba 0.19
  • Fixes issue with GPU ufunc broadcasting
  • Improves GPU ufunc implementation

Version (0.18.0)

  • Depends on numba 0.18.1
  • Improve CUDA gufunc implementation
    • Simplified code generation
    • Smarter blocksize selection

Version (0.17.1)

  • Depends on numba 0.17.0
  • Warns about incompatible numba version at import time
  • Fixes some CUDA library APIs on windows

Version (0.17.0)

  • Depends on numba 0.16.0
  • Replaces llvmpy with llvmlite, which also upgrades to llvm3.5
  • Update occupancy autotuner for CC 5.0 and CC 5.2 devices
  • Fix handling of empty array in GPU reduction
  • Fix occupancy autotuner that may pick invalid blocksize

Version (0.16.0)

  • Add numbapro.cuda.reduce for autogeneration of CUDA reduce kernels and driver.
  • Fix device to host auto transfer logic in some ufunc function.
  • Upgrades to Numba 0.15

Version (0.15.0)

Version (0.14.3)

  • CUDA driver is initialized lazily
  • Improved stability of CUDA ufunc machinery
  • Improved stability of parallel ufunc

Version (0.14.2)

  • Unify numba.cuda and numbapro.cuda backend
  • Enable Python 3 support
  • Fixes workqueue module import for embedded python usecase

Version (0.14.1)


  • UnboundReferenceError due to mishandling of incompatible driver (pre CUDA5.5 driver). The fix relaxes the driver requirement by allowing some features to fail on use.
  • numbapro.cuda.* symbols are still exported when CUDA is not available. They would raise execption on use.

Version (0.14.0)


  • Add cuSparse API
  • Improve CUDA driver and resource management
  • Some of CUDA-python language feature is now opensourced as numba.cuda


  • New CUDA driver system prevents freezing OSX on kernel launch error

Version (0.13.2)

  • Fix problem with numpy 18 array scalar contiguousness
  • Fix CUDA target auto initialization on import numbapro
  • Fix an access violation error on Windows 8 due to mishandling by LLVM.
  • Add non-public API for profiler control.

Version (0.13.1)

  • Guard error due to mishandling of interleaved memory buffer (#60)
  • Update to use Numba 0.12.1
  • Fix powi bug

Version (0.13)

  • Add print statement for strings and scalar numeric types for debugging on GPU
  • Add constant and local memory array allocation on GPU
  • Add debug mode for GPU
  • Allow raising exception classes on GPU
  • Update CUDA toolkit libraries
  • Fix boolean mapping

Version (0.12.7)

  • Fix major bug that mistreats py2 division as inplace floor-division for real numbers.
  • Fix using of array as argument of a CUDA device function.
  • Delay initialization the CUDA subsystem upon first import of the cuda package.
  • Add docstrings.

Version (0.12.6)

  • Fix major bug that mistreats py2 division as floor-division for real numbers.

Version (0.12.5)

  • Update to Numba 0.10.2
  • Update to LLVM 3.3
  • Various bug fixes

Version (0.12.4)

  • Update to Numba 0.10.0
  • Minor bug fixes

Version (0.12.3)

  • Accept older driver by defering driver error to first use of specific API
  • Report incompatible GPU at context creation
  • Improve device information reporting
  • Autotuning base on compiler info and occupancy calculator
  • Add basic support for ravel and reshape

Version (0.12.2)

  • Distribute CUDA toolkit in Anaconda
  • Better error message
  • Fix gufunc signature parsing to accept trailing comma.
  • Fix CUDA driver log info bug
  • Support JIT linking

Version (0.12.1)

  • Fix libNVVM search path (now accept directory path)
  • Fix sign-extension error in forloop precondition
  • Fix support for true-division

Version (0.12.0)

  • Use CUDA 5.5rc
  • Expand math support through CUDA NVVM libdevice
  • Rewritten nopython mode for CUDA-Python
  • Removed experimental CU API
  • Removed minivectorize

Version (0.11.0)

  • Add cuBlas binding
  • Improve CUDA ndarray and memory managment
  • Add CUDA mapped host memory
  • Add CUDA event

Version (0.10.1)

  • Fix CU memory leak
  • Fix CU hanging on some GPU
  • Improve error message for unsupported GPU devices
  • Add cuFFT

Version (0.10)

  • Added Compute Unit (CU) API
  • Added cuRAND binding
  • Added CUDA device array
  • Various improvements to CUDA support

Version (0.9)

  • Improve CUDA driver discovery.

Version 0.8

  • Update for SSA types inference in Numba
  • Allow user to select CUDA device
  • Add support for pinned and mapped CUDA memory
  • Improvement on small memory allocation in CUDA
  • Default to use libNVVM from Anaconda
  • Bug fixes

Version 0.7

  • Prange: parallel for-range
  • Array slicing
  • Refactor CUDA dispatch mechanisms
  • Migrate to NVVM instead of PTX for CUDA codegen

Version 0.6 and earlier

  • Array expressions
  • Fast ufuncs and generalized-ufunc (gufunc) with single-core, multi-core and CUDA