cuBLAS

Provides basic linear algebra building blocks. See NVIDIA cuBLAS.

The cuBLAS binding provides an interface that accepts NumPy arrays and Numba’s CUDA device arrays. The binding automatically transfers NumPy array arguments to the device as required. This automatic transfer may generate some unnecessary transfers, so optimal performance is likely to be obtained by the manual transfer for NumPy arrays into device arrays and using the cuBLAS to manipulate device arrays where possible.

No special naming convention is used to identify the data type, unlike in the BLAS C and Fortran APIs. Arguments for array storage information which are part of the cuBLAS C API are also not necessary since NumPy arrays and device arrays contain this information.

All functions are accessed through the accelerate.cuda.blas.Blas class:

class accelerate.cuda.blas.Blas(*args, **kws)

All BLAS subprograms are available under the Blas object.

Parameters:stream – Optional. A CUDA Stream.

BLAS Level 1

accelerate.cuda.blas.Blas.nrm2(x)

Computes the L2 norm for array x. Same as numpy.linalg.norm(x).

Parameters:x (python.array) – input vector
Returns:resulting norm.
accelerate.cuda.blas.Blas.dot(x, y)

Compute the dot product of array x and array y. Same as np.dot(x, y).

Parameters:
  • x (python.array) – vector
  • y (python.array) – vector
Returns:

dot product of x and y

accelerate.cuda.blas.Blas.dotc(x, y)

Uses the conjugate of the element of the vectors to compute the dot product of array x and array y for complex dtype only. Same as np.vdot(x, y).

Parameters:
  • x (python.array) – vector
  • y (python.array) – vector
Returns:

dot product of x and y

accelerate.cuda.blas.Blas.scal(alpha, x)

Scale x inplace by alpha. Same as x = alpha * x

Parameters:
  • alpha – scalar
  • x (python.array) – vector
accelerate.cuda.blas.Blas.axpy(alpha, x)

Compute y = alpha * x + y inplace.

Parameters:
  • alpha – scalar
  • x (python.array) – vector
accelerate.cuda.blas.Blas.amax(x)

Find the index of the first largest element in array x. Same as np.argmax(x)

Parameters:x (python.array) – vector
Returns:index (start from 0).
accelerate.cuda.blas.Blas.amin(x)

Find the index of the first largest element in array x. Same as np.argmin(x)

Parameters:x (python.array) – vector
Returns:index (start from 0).
accelerate.cuda.blas.Blas.asum(x)

Compute the sum of all element in array x.

Parameters:x (python.array) – vector
Returns:x.sum()
accelerate.cuda.blas.Blas.rot(x, y, c, s)

Apply the Givens rotation matrix specified by the cosine element c and the sine element s inplace on vector element x and y.

Same as x, y = c * x + s * y, -s * x + c * y

Parameters:
  • x (python.array) – vector
  • y (python.array) – vector
accelerate.cuda.blas.Blas.rotg(a, b)

Constructs the Givens rotation matrix with the column vector (a, b).

Parameters:
  • a – first element of the column vector
  • b – second element of the column vector
Returns:

a tuple (r, z, c, s)

r – r = a**2 + b**2

z – Use to reconstruct c and s.

Refer to cuBLAS documentation for detail.

c – The consine element.

s – The sine element.

accelerate.cuda.blas.Blas.rotm(x, y, param)

Applies the modified Givens transformation inplace.

Same as:

param = flag, h11, h21, h12, h22
x[i] = h11 * x[i] + h12 * y[i]
y[i] = h21 * x[i] + h22 * y[i]

Refer to the cuBLAS documentation for the use of flag.

Parameters:
  • x (python.array) – vector
  • y (python.array) – vector
accelerate.cuda.blas.Blas.rotmg(d1, d2, x1, y1)

Constructs the modified Givens transformation H that zeros out the second entry of a column vector (d1 * x1, d2 * y1).

Parameters:
  • d1 – scaling factor for the x-coordinate of the input vector
  • d2 – scaling factor for the y-coordinate of the input vector
  • x1 – x-coordinate of the input vector
  • y1 – y-coordinate of the input vector
Returns:

A 1D array that is usable in rotm. The first element is the flag for rotm. The rest of the elements corresponds to the h11, h21, h12, h22 elements of H.

BLAS Level 2

All level 2 routines follow the following naming convention for all arguments:

  • A, B, C, AP – (2D array) Matrix argument.

    AP implies packed storage for banded matrix.

  • x, y, z – (1D arrays) Vector argument.

  • alpha, beta – (scalar) Can be floats or complex numbers depending.

  • m – (scalar) Number of rows of matrix A.

  • n – (scalar) Number of columns of matrix A. If m is not needed,

    n also means the number of rows of the matrix A; thus, implying a square matrix.

  • trans, transa, transb – (string)

    Select the operation op to apply to a matrix:

    • ‘N’: op(X) = X, the identity operation;
    • ‘T’: op(X) = X**T, the transpose;
    • ‘C’: op(X) = X**H, the conjugate transpose.

    trans only applies to the only matrix argument. transa and transb apply to matrix A and matrix B, respectively.

  • uplo – (string) Can be ‘U’ for filling the upper trianglar matrix; or ‘L’ for

    filling the lower trianglar matrix.

  • diag – (boolean) Whether the matrix diagonal has unit elements.

  • mode – (string) ‘L’ means the matrix is on the left side in the equation.

    ‘R’ means the matrix is on the right side in the equation.

Note

The last array argument is always overwritten with the result.

accelerate.cuda.blas.Blas.gbmv(trans, m, n, kl, ku, alpha, A, x, beta, y)

banded matrix-vector multiplication y = alpha * op(A) * x + beta * y where A has kl sub-diagonals and ku super-diagonals.

accelerate.cuda.blas.Blas.gemv(trans, m, n, alpha, A, x, beta, y)

matrix-vector multiplication y = alpha * op(A) * x + beta * y

accelerate.cuda.blas.Blas.trmv(uplo, trans, diag, n, A, x)

triangular matrix-vector multiplication x = op(A) * x

accelerate.cuda.blas.Blas.tbmv(uplo, trans, diag, n, k, A, x)

triangular banded matrix-vector x = op(A) * x

accelerate.cuda.blas.Blas.tpmv(uplo, trans, diag, n, AP, x)

triangular packed matrix-vector multiplication x = op(A) * x

accelerate.cuda.blas.Blas.trsv(uplo, trans, diag, n, A, x)

Solves the triangular linear system with a single right-hand-side. op(A) * x = b

accelerate.cuda.blas.Blas.tpsv(uplo, trans, diag, n, AP, x)

Solves the packed triangular linear system with a single right-hand-side. op(A) * x = b

accelerate.cuda.blas.Blas.tbsv(uplo, trans, diag, n, k, A, x)

Solves the triangular banded linear system with a single right-hand-side. op(A) * x = b

accelerate.cuda.blas.Blas.symv(uplo, n, alpha, A, x, beta, y)

symmetric matrix-vector multiplication y = alpha * A * x + beta * y

accelerate.cuda.blas.Blas.hemv(uplo, n, alpha, A, x, beta, y)

Hermitian matrix-vector multiplication y = alpha * A * x + beta * y

accelerate.cuda.blas.Blas.sbmv(uplo, n, k, alpha, A, x, beta, y)

symmetric banded matrix-vector multiplication y = alpha * A * x + beta * y

accelerate.cuda.blas.Blas.hbmv(uplo, n, k, alpha, A, x, beta, y)

Hermitian banded matrix-vector multiplication y = alpha * A * x + beta * y

accelerate.cuda.blas.Blas.spmv(uplo, n, alpha, AP, x, beta, y)

symmetric packed matrix-vector multiplication y = alpha * A * x + beta * y

accelerate.cuda.blas.Blas.hpmv(uplo, n, alpha, AP, x, beta, y)

Hermitian packed matrix-vector multiplication y = alpha * A * x + beta * y

accelerate.cuda.blas.Blas.ger(m, n, alpha, x, y, A)

the rank-1 update A := alpha * x * y ** T + A

accelerate.cuda.blas.Blas.geru(m, n, alpha, x, y, A)

the rank-1 update A := alpha * x * y ** T + A

accelerate.cuda.blas.Blas.gerc(m, n, alpha, x, y, A)

the rank-1 update A := alpha * x * y ** H + A

accelerate.cuda.blas.Blas.syr(uplo, n, alpha, x, A)

symmetric rank 1 operation A := alpha * x * x ** T + A

accelerate.cuda.blas.Blas.her(uplo, n, alpha, x, A)

hermitian rank 1 operation A := alpha * x * x ** H + A

accelerate.cuda.blas.Blas.spr(uplo, n, alpha, x, AP)

the symmetric rank 1 operation A := alpha * x * x ** T + A

accelerate.cuda.blas.Blas.hpr(uplo, n, alpha, x, AP)

hermitian rank 1 operation A := alpha * x * x ** H + A

accelerate.cuda.blas.Blas.syr2(uplo, n, alpha, x, y, A)

symmetric rank-2 update A = alpha * x * y ** T + y * x ** T + A

accelerate.cuda.blas.Blas.her2(uplo, n, alpha, x, y, A)

Hermitian rank-2 update A = alpha * x * y ** H + alpha * y * x ** H + A

accelerate.cuda.blas.Blas.spr2(uplo, n, alpha, x, y, A)

packed symmetric rank-2 update A = alpha * x * y ** T + y * x ** T + A

accelerate.cuda.blas.Blas.hpr2(uplo, n, alpha, x, y, A)

packed Hermitian rank-2 update A = alpha * x * y ** H + alpha * y * x ** H + A

BLAS Level 3

All level 3 routines follow the same naming convention for arguments as in level 2 routines.

accelerate.cuda.blas.Blas.gemm(transa, transb, m, n, k, alpha, A, B, beta, C)

matrix-matrix multiplication C = alpha * op(A) * op(B) + beta * C

accelerate.cuda.blas.Blas.syrk(uplo, trans, n, k, alpha, A, beta, C)

symmetric rank- k update C = alpha * op(A) * op(A) ** T + beta * C

accelerate.cuda.blas.Blas.herk(uplo, trans, n, k, alpha, A, beta, C)

Hermitian rank- k update C = alpha * op(A) * op(A) ** H + beta * C

accelerate.cuda.blas.Blas.symm(side, uplo, m, n, alpha, A, B, beta, C)

symmetric matrix-matrix multiplication:

if  side == 'L':
    C = alpha * A * B + beta * C
else:  # side == 'R'
    C = alpha * B * A + beta * C
accelerate.cuda.blas.Blas.hemm(side, uplo, m, n, alpha, A, B, beta, C)

Hermitian matrix-matrix multiplication:

if  side == 'L':
    C = alpha * A * B + beta * C
else:   #  side == 'R':
    C = alpha * B * A + beta * C
accelerate.cuda.blas.Blas.trsm(side, uplo, trans, diag, m, n, alpha, A, B)

Solves the triangular linear system with multiple right-hand-sides:

if  side == 'L':
    op(A) * X = alpha * B
else:       # side == 'R'
    X * op(A) = alpha * B
accelerate.cuda.blas.Blas.trmm(side, uplo, trans, diag, m, n, alpha, A, B, C)

triangular matrix-matrix multiplication:

if  side == ':'
    C = alpha * op(A) * B
else:   # side == 'R'
    C = alpha * B * op(A)
accelerate.cuda.blas.Blas.dgmm(side, m, n, A, x, C)

matrix-matrix multiplication:

if  mode == 'R':
    C = A * x * diag(X)
else:       # mode == 'L'
    C = diag(X) * x * A
accelerate.cuda.blas.Blas.geam(transa, transb, m, n, alpha, A, beta, B, C)

matrix-matrix addition/transposition C = alpha * op(A) + beta * op(B)