CUDA Programming—Wolfram Documentation

Wolfram Language & System Documentation Center

CUDA Programming

Introduction	CUDA Kernels
CUDA Support	Advanced Operations

Introduction

CUDA is a parallel computing architecture for Graphics Processing Units (GPUs). It contains functions that use CUDA-enabled GPUs to boost performance in a number of areas. This is particularly relevant for applying related functions to large blocks of data that can be done in parallel. In recent years, this has become particularly useful for machine learning applications.

CUDA functionality requires that a CUDA-enabled GPU be present. In addition, for many features, extra software may be required. If no CUDA GPU is present, CUDA programming cannot work. It should be noted that CUDA functionality is available on Windows and Linux machines but not on macOS machines.

The Wolfram Language makes use of CUDA to accelerate a number of features where this is possible and advantageous. As is typical, the Wolfram Language features attempt to make use of the functionality as smoothly as possible.

The Wolfram Language also provides access to develop and execute CUDA programs on CUDA hardware. This includes functionality built around the ExternalEvaluate framework and the Wolfram Compiler. Working with CUDA programming from the Wolfram Language allows interesting and unique solutions to be developed. It is also a very powerful way to learn and explore the technology.

Basics

CUDA programming basics are quite simple. The CUDA-enabled GPU works as a separate computation device with its own memory and its own processors that coexists with a host computer. In contrast with a modern computer, which may have up to tens of cores, a CUDA GPU can have up to thousands of cores.

The essence of CUDA programming involves loading data onto the CUDA device. Code, known as kernels, is then run on the data. Finally, the data is copied back from the device to the host computer to form results. Of course, there are many details involved in CUDA programming, and these will be covered in this tutorial.

GPUArray

GPUArray is a convenient Wolfram Language feature for working with data on a GPU. If it is used on a computer that has a suitable CUDA device, then it will automatically store data on the GPU. Later sections will cover how you can determine if there is a suitable CUDA device.

Data can be created using a Wolfram Language List. (Alternatively, you could use other functions such as Range, Table or ConstantArray to make data.)

The following is data created on the host computer:

Wolfram Language code: hostData = {0., 0.5, 1., 1.5, 2., 2.5, 3., 3.5}

If the system has a suitable CUDA GPU available, then GPUArray will copy the data to the GPU and hold a reference:

Wolfram Language code: gpuData = GPUArray[hostData]

The data can be retrieved from the GPU by using Normal:

Wolfram Language code: Normal[gpuData]

A GPUArray can also be created using NumericArray. This can be useful because it allows the type of the data to be set. The details of types and CUDA programming will be covered in a later section:

Wolfram Language code: data = GPUArray[NumericArray[Range[10], "Integer64"]]

One thing to note is that GPUArray will work even if there is no CUDA GPU. This is because its mission is to provide a platform- and technology-neutral interface to GPU operations. This means that if no GPU is available, data storage and operations all use the host machine. However, if a CUDA GPU is available, then GPUArray will use it.

GPUArray has a lot of nice functionality that is documented elsewhere. This tutorial uses GPUArray as a convenient way to work with data stored on a CUDA device.

ExternalEvaluate for CUDA

The Wolfram Language provides an external evaluation framework for working with external languages, and this includes working with CUDA. It provides a convenient and simple way for CUDA programming that helps avoid many of the initial difficulties that can make the technology hard to use, especially for new users.

To work with CUDA through external evaluation, functions to run on the CUDA hardware are required. These can be written in two ways: first, using the Wolfram Language; second, using C++. When code is written with the Wolfram Language, the Wolfram Compiler will be used to compile it to CUDA code.

The following is a Wolfram Language function that can run on a CUDA GPU. It has arguments that are marked up with type annotations containing low-level types that are also used for working with the Wolfram Compiler.

This function takes a "CArray" that represents a block of data stored on the GPU. It also takes a machine integer that represents the length of the data. The code uses functions to help identity which particular processor is being used (this is common for low-level parallel programming). The details of these are explained later.

The function then reads from the data, adds 5 and stores the value back:

Wolfram Language code:

func = Function[{Typed[x, "CArray"::["Integer64"]], Typed[n, "MachineInteger"]}, 
	Module[{id}, 
	id = LibraryFunction["BlockDimensions.x"][] * LibraryFunction["BlockID.x"][] + LibraryFunction["ThreadID.x"][];
	If[id < n, 
	ToRawPointer[ x, id, FromRawPointer[ x, id] + 5]
	];
	]
	];

The function can then be compiled into an ExternalFunction object:

Wolfram Language code: gpuFun = ExternalEvaluate["CUDA", func]

The first time this executes, the available hardware and software will be tested. There may even be installation of software in a Wolfram Language paclet to facilitate operation. Details of configuration for CUDA programming and how to deal with problems are described in the section on CUDA support.

If an external function object is returned, then the function can be executed with suitable data such as a GPUArray:

Wolfram Language code: data = GPUArray[NumericArray[Range[10], "Integer64"]];

Now the function can be invoked by applying it to the arguments:

Wolfram Language code: gpuFun[data, 10]

The result was written into the array argument, as can be seen in the following:

Wolfram Language code: Normal[data]

It is also possible to use StartExternalSession to create a CUDA session, as in the following:

Wolfram Language code: session = StartExternalSession["CUDA"]

The session can be used with ExternalEvaluate to compile code to an ExternalFunction:

Wolfram Language code: gpuFun = ExternalEvaluate[session, func]

Now the function can be invoked by applying it to suitable arguments:

Wolfram Language code:

data = GPUArray[NumericArray[Range[10], "Integer64"]];
gpuFun[data, 10]

The result was written into the array argument, as can be seen in the following:

Wolfram Language code: Normal[data]

A CUDA session can be useful to track down problems in setting up for CUDA operations. These are investigated in the next section on CUDA support.

CUDA Support

To use CUDA functionality, there are various requirements that must be fulfilled. If ExternalEvaluate returns as expected, then operations should continue to succeed. However, if there are errors, then there needs to be some further configuration.

One convenient way to investigate these is to create a CUDA session with StartExternalSession. This will inspect the system on which it is running and attempt to find the necessary components. It will return a session object if CUDA functionality will work. As seen previously, this session object can be used in CUDA operations.

When trying to set up, a first check is made for CUDA drivers and the libraries that are bundled with the Wolfram Language. If these cannot be found or are unsupported, then a message and a failure are returned. As shown in the following section, if the platform does not support CUDA, this is reported.

If the hardware and drivers are present, a check is made for a CUDA toolkit. Wolfram Research provides this in a paclet called "WolframGPUSDK". If this paclet is installed, the toolkit it contains will be used. If no paclet is installed, the functionality looks for a viable CUDA toolkit in the system. If no installed CUDA toolkit is found, the system will download the "WolframGPUSDK" paclet.

When all of these requirements are fulfilled, StartExternalSession returns an ExternalSessionObject as in the following:

Wolfram Language code: StartExternalSession["CUDA"]

macOS

There is no support for CUDA functionality on the macOS platform:

Wolfram Language code: $OperatingSystem

StartExternalSession will issue a message and return a failure object:

Wolfram Language code: StartExternalSession["CUDA"]

CUDA SDK Paclet

The "WolframGPUSDK" paclet is provided by Wolfram Research to allow CUDA functionality to work in cases where a CUDA SDK cannot be found.

If the paclet is installed, then it will be used. Otherwise, the system searches for a viable CUDA toolkit. If this is not found, then the "WolframGPUSDK" paclet is installed.

You can install the paclet yourself with PacletInstall. The following uses PacletFind, which shows that in this system it has not been installed:

Wolfram Language code: PacletFind["WolframGPUSDK"]

PacletInstall installs the paclet; this is quite lengthy:

Wolfram Language code: PacletInstall["WolframGPUSDK"]

Now the paclet is found:

Wolfram Language code: PacletFind["WolframGPUSDK"]

As indicated previously, if the "WolframGPUSDK" paclet is installed, it will be used.

Verifying Support

If StartExternalSession returns a valid CUDA session, then CUDA operations should work. However, if it cannot verify support for CUDA functionality, it will return a failure object and action will need to be taken. In this case, various steps can be taken that provide more details than just running StartExternalSession.

A core way is to use CUDADeviceInformation, which returns information on CUDA devices. To use this, a package has to be loaded:

Wolfram Language code: Needs["CUDATools`"]

This gives information on CUDA devices that have been found:

Wolfram Language code: CUDADeviceInformation[]

Even more information can be obtained from the result of InstallCUDA. This is called automatically by the ExternalEvaluate framework, but is useful to learn about the files and system being used. It has to be accessed by loading a package:

Wolfram Language code: Needs["CUDALink`"]

If CUDA is supported, the result is a success object, and this contains detailed information:

Wolfram Language code: InstallCUDA[]

macOS

CUDADeviceInformation can also be run on macOS. It shows that there are no CUDA devices:

Wolfram Language code: Needs["CUDATools`"]

Wolfram Language code: CUDADeviceInformation[]

CUDA Kernels

CUDA functionality is based on executing functions called kernels on the CUDA GPU hardware. The GPU hardware provides a large number of processors to run the kernels. The system used in these examples has 5888 cores (as reported by CUDADeviceInformation). As is typical for performance computing, a key element is often less the speed of the cores running the code than the speed at which data can be moved from memory to the cores. CUDA GPUs typically have their own memory system that is distinct from the memory of the host system. The system used in these examples has about 8GB of GPU memory (as reported by CUDADeviceInformation).

This section of the documentation describes some details of CUDA kernels and how to work with them in the Wolfram Language.

Threads

A key aspect of how CUDA kernels are run is the arrangement of threads into grids. Parameters for this are set when the kernels are launched and used by the CUDA scheduler. In the code of a CUDA kernel, details of the thread can be determined by various intrinsic functions.

This document is not going to be a detailed description of the way that CUDA kernels run on the GPU. However, a little information is provided here. On the CUDA GPU, the same kernel is executed many times in threads. Threads are grouped into blocks and blocks are grouped into grids. A rough picture is shown below.

The grouping in both blocks and grids can be set to have one, two or three dimensions. As described previously, the grouping is set when the kernels are launched. When CUDA kernels are run with the Wolfram Language ExternalEvaluate system, parameters for the allocation of the thread are automatically chosen. This works well for simple problems such as running independent calculations over a block of data. In this case, a one-dimensional layout is used, which is particularly simple and easy to understand.

In a one-dimensional case, each thread can be globally identified by computing an ID with the following code in Wolfram Language:

LibraryFunction["BlockDimensions.x"][]*LibraryFunction["BlockID.x"][]+LibraryFunction["ThreadID.x"][]

and the following in C++:

blockIdx.x * blockDim.x + threadIdx.x

These are patterns of code that are commonly seen in CUDA kernels. Note that the intrinsics such as "ThreadID.x" identify the thread in each block. To identify a thread globally (compared with any other thread), a computation such as above is required.

Now some computations that demonstrate these are explored.

Here code is compiled into a CUDA kernel. It computes the global thread ID and saves this in the input data:

Wolfram Language code:

gpuFun = ExternalEvaluate["CUDA", Function[{Typed[x, "CArray"::["Integer32"]], Typed[n, "MachineInteger"]}, 
	Module[{id}, 
	id = LibraryFunction["BlockDimensions.x"][] * LibraryFunction["BlockID.x"][] + LibraryFunction["ThreadID.x"][];
	If[id < n, 
	ToRawPointer[ x, id, id]
	];
	]
	]];

This creates some data and uses it to execute on the GPU. The data has a length of 2²⁰:

Wolfram Language code:

len = 2 ^ 20;
d = GPUArray[NumericArray[ConstantArray[0, {len}], "Integer32"]];
gpuFun[d, len]

This shows the first 20 elements:

Wolfram Language code: Take[Normal[d], 20]

This shows that the total number of threads is equal to the number of data elements:

Wolfram Language code: Normal[d] === Range[0, len - 1]

A different computation can be done by restricting the number of grids to 1 and the number of dimensions to 256. This means that only 256 threads can run:

Wolfram Language code:

len = 512;
d = GPUArray[NumericArray[ConstantArray[0, {len}], "Integer32"]];
gpuFun[d, len, "GridDimensions" -> 1, "BlockDimensions" -> 256]

This shows that only 256 threads were run:

Wolfram Language code: Normal[d]

More insight can be obtained by storing the ID of each thread in a block in the output:

Wolfram Language code:

gpuFun = ExternalEvaluate["CUDA", Function[{Typed[x, "CArray"::["Integer32"]], Typed[n, "MachineInteger"]}, 
	Module[{id}, 
	id = LibraryFunction["BlockDimensions.x"][] * LibraryFunction["BlockID.x"][] + LibraryFunction["ThreadID.x"][];
	If[id < n, 
	ToRawPointer[ x, id, LibraryFunction["ThreadID.x"][]]
	];
	]
	]];

This launches the kernel:

Wolfram Language code:

len = 8192;
data = GPUArray[NumericArray[ConstantArray[0, len], "Integer32"]];
gpuFun[data, len]

You can see that the maximum number of each thread in the block is 255; this is because the block dimensions are set to 256:

Wolfram Language code: Take[Normal[data], 500]

Now run the same calculation but only using 32 threads in each block:

Wolfram Language code:

len = 8192;
data = GPUArray[NumericArray[ConstantArray[0, len], "Integer32"]];
gpuFun[data, len, "GridDimensions" -> Quotient[len, 32], "BlockDimensions" -> 32]

This shows the maximum ID of each thread in a block is 31. This is because the block dimensions were set to 32:

Wolfram Language code: Take[Normal[data], 500]

More computations and experiments can be done for two- and three-dimensional setups. This shows the benefits of using the Wolfram Language interface to CUDA programming to learn and understand how the technology works.

Arguments

CUDA kernels are functions that execute in a highly parallel manner on a CUDA GPU. Kernels work by manipulating the memory stored on the GPU and return no values. Therefore, the arguments to a CUDA kernel must include one reference to a block of memory on the GPU.

The Wolfram Language tools for working with CUDA kernels allow you to create them from Wolfram Language or C++ programs. This section discusses the types you can use to write CUDA kernels and how you can create data with which to invoke them.

Wolfram Language Types

The ExternalEvaluate "CUDA" system allows you to create CUDA kernels from Wolfram Language functions. A typical function is shown in the following:

Wolfram Language code:

Function[{Typed[x, "CArray"::["Integer64"]], Typed[n, "Integer64"]}, Module[{id}, id = LibraryFunction["BlockDimensions.x"][] * LibraryFunction["BlockID.x"][] + LibraryFunction["ThreadID.x"][];If[id < n, ToRawPointer[ x, id, FromRawPointer[ x, id] + 5]];]]

The arguments to the function use Typed to denote the type with which the function should be called. The type is also used by the Wolfram Compiler to determine the types of all the parts of the code it compiles.

For CUDA kernels, the types can only be a limited set of raw types and are described here. In the example above, the type of the x argument is given as "CArray"::["Integer64"], which is a reference to a block of 64-bit integers stored in memory on the GPU. The n argument is a 64-bit integer. Note that there is no return value.

The CUDA kernel is launched with arguments. For scalar arguments such as integers, these are converted from compatible Wolfram Language expressions. For arguments that reference blocks of memory, these are converted from Wolfram Language expressions that hold data on a CUDA GPU. An example of this is GPUArray.

C++ Types

The ExternalEvaluate "CUDA" system allows you to create CUDA kernels from C++ code. A typical function is shown in the following:

Wolfram Language code:

"__global__ void addFun(double *x, int n) {
    const int tid =  blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        x[tid] = x[tid] + 5;
    }
}"

For CUDA kernels, the types can only be a limited set of raw types and are described here. In the example above, the type of the x argument is given as double*, which is a reference to a block of double-precision values stored in memory on the GPU. The n argument is an int, which will typically be a 32-bit integer. Note that there is no return value.

Creating Data

When CUDA kernels are executed, they need arguments. Scalar arguments such as integers are easily converted from Wolfram Language expressions. However, there must be at least one argument that references data stored in memory on the GPU. Typically, this is created by GPUArray.

GPUArray was introduced above as a convenient way to work with Wolfram Language data on a CUDA GPU. While it will work in a sensible fashion on systems that do not support CUDA, it will always store its data on a CUDA GPU if one is available.

Data can be created using a Wolfram Language List. (Alternatively, you could use other functions such as Range, Table or ConstantArray to make data.):

Wolfram Language code: hostData = {0., 0.5, 1., 1.5, 2., 2.5, 3., 3.5}

If the system has a suitable CUDA GPU available, then GPUArray will copy the data to the GPU and hold a reference:

Wolfram Language code: gpuData = GPUArray[hostData]

The data can be retrieved from the GPU by using Normal:

Wolfram Language code: Normal[gpuData]

A GPUArray can also be created using NumericArray:

Wolfram Language code: data = GPUArray[NumericArray[Range[10], "Integer32"]]

This can be useful because it allows the type of the data to be set. This is important in order to match the type of any kernel being called and has been used extensively in this tutorial.

Creating Kernels

CUDA kernels are created by compilation. The source can be written in the Wolfram Language or in CUDA C++. There are several advantages of using the Wolfram Language, but one that stands out is that functions can be created by other Wolfram Language code. When C++ is used to create CUDA kernels, this makes use of the standard Nvidia tools for processing C++, so all features of kernels are available.

ExternalEvaluate is a simple way to create kernels, and a number of examples have been seen in this document. However, it is also possible to use ExternalOperation and gain more control over the creation process. This is explored in the later sections.

Here is an example of compiling the Wolfram Language with ExternalEvaluate:

Wolfram Language code:

gpuFun = ExternalEvaluate["CUDA", Function[{Typed[x, "CArray"::["Integer64"]], Typed[n, "MachineInteger"]}, 
	Module[{id}, 
	id = LibraryFunction["BlockDimensions.x"][] * LibraryFunction["BlockID.x"][] + LibraryFunction["ThreadID.x"][];
	If[id < n, 
	ToRawPointer[ x, id, FromRawPointer[ x, id] + 5]
	];
	]
	]]

This can be called by passing arguments:

Wolfram Language code:

data = GPUArray[NumericArray[Range[10], "Integer64"]];
gpuFun[data, 10];
data//Normal

CUDA kernels can also be created from C++. This can be given either as a string or as a file. The following creates a kernel from a string of C++:

Wolfram Language code:

code = "#include \"WolframLibrary.h\"
__global__ void addFun(double *x, mint N) {
    const int tid =  blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N) {
        x[tid] = x[tid] + 5;
    }
}";
gpuFun = ExternalEvaluate["CUDA", code]

ExternalOperation Function

ExternalOperation with operation "Function" can be used to get more control over the creation process, for example, passing in more declarations.

An example is shown by the following that uses FunctionDeclaration. First, here is the declaration:

Wolfram Language code: decl = FunctionDeclaration[increment, Typed[{"Real64"} -> "Real64"]@Function[arg, arg + 1]];

Now, here is a Wolfram Language function that uses the declaration:

Wolfram Language code:

func = Function[{Typed[x, "CArray"::["Real64"]], Typed[n, "MachineInteger"]}, 
	Module[{id}, 
	id = LibraryFunction["BlockDimensions.x"][] * LibraryFunction["BlockID.x"][] + LibraryFunction["ThreadID.x"][];
	If[id < n, 
	ToRawPointer[ x, id, increment[FromRawPointer[ x, id]]]
	];
	]
	];

Compilation can be done with the ExternalOperation "Function", using both the declaration and a function:

Wolfram Language code: gpuFun = ExternalEvaluate["CUDA", ExternalOperation["Function", decl, func]]

ExternalOperation is also useful because options to control the compilation can be passed. In the following, an option to the Nvidia compiler is given and also an option that prints out the command being used:

Wolfram Language code:

code = "#include \"WolframLibrary.h\"
__global__ void addFun(double *x, mint N) {
    const int tid =  blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N) {
        x[tid] = x[tid] + 5;
    }
}";
ExternalEvaluate["CUDA", 
	ExternalOperation["Function", code
	, 
	"CompileOptions" -> {"--generate-line-info"}
	, 
	"ShellCommandFunction" -> Print
	]
	]

ExternalOperation Program

ExternalOperation with operation "Program" can be used to compile several functions at once.

Working with code in the Wolfram Language, several functions can be given in an association as follows:

Wolfram Language code:

funs = <|"add" -> Function[{Typed[x, "CArray"::["Real64"]], Typed[n, "MachineInteger"]}, 
	Module[{id}, 
	id = LibraryFunction["BlockDimensions.x"][] * LibraryFunction["BlockID.x"][] + LibraryFunction["ThreadID.x"][];
	If[id < n, 
	ToRawPointer[ x, id, FromRawPointer[ x, id] + 1]
	];
	]
	], "sub" -> Function[{Typed[x, "CArray"::["Real64"]], Typed[n, "MachineInteger"]}, 
	Module[{id}, 
	id = LibraryFunction["BlockDimensions.x"][] * LibraryFunction["BlockID.x"][] + LibraryFunction["ThreadID.x"][];
	If[id < n, 
	ToRawPointer[ x, id, FromRawPointer[ x, id]11]
	];
	]
	]|>;

Compilation of all the functions can be done with a program as follows:

Wolfram Language code: gpuFuns = ExternalEvaluate["CUDA", ExternalOperation["Program", funs]]

The functions can be extracted from the ExternalObject result using the "Functions" property:

Wolfram Language code: gpuFuns["Functions"]

The ExternalObject result has other properties that are useful. This can be demonstrated by compiling code that contains more than one kernel:

Wolfram Language code:

file = File[FileNameJoin[{PacletObject["CUDALink"]["Location"], "CUDACode", "arithmetic_kernels.cu"}]];
FilePrint[file]

Here the source file is compiled:

Wolfram Language code: gpuObj = ExternalEvaluate["CUDA", ExternalOperation["Program", file]]

The "Functions" property can be used to extract the actual functions:

Wolfram Language code: gpuObj["Functions"]

However, it is possible that the functions have the same name, in which case, using a property that returns the kernels labeled by their mangled name might be useful:

Wolfram Language code: gpuObj["FunctionsByMangledName"]

Another useful property is "Kernels", which returns "CUDAKernel" objects. These are the lowest-level representation and have some advantages that are described later. You can see this as follows:

Wolfram Language code: gpuObj["Kernels"]

Advanced Operations

This section describes advanced ways to use the CUDA functionality that give more flexibility and better performance. They do this by avoiding the higher-level function of ExternalEvaluate and GPUArray. These are useful but come with some overhead.

CUDAVector

"CUDAVector" objects are used to hold data on CUDA hardware. Unlike GPUArray, they only store data on CUDA hardware; they also have very low overhead. To use them, a package of CUDA functionality has to be loaded as follows:

Wolfram Language code: Needs["CUDALink`"]

Now a vector of data stored on the hardware is made as follows:

Wolfram Language code:

data = Range[0., 2., 0.1];
cuVec = CUDAVector[data, "Real32"]

They take an argument that describes the type of data to be created and supports a wide range of types. The following creates a vector of "Real16":

Wolfram Language code: cuVec = CUDAVector[data, "Real16"]

They can be used as arguments to CUDA kernels in the place of GPUArray.

The length of the data can be determined:

Wolfram Language code: Length[cuVec]

To extract the data, you should use Normal:

Wolfram Language code: na = Normal[cuVec]

The result is a NumericArray that preserves the exact type representation. To see the data in a way to use as an expression, you can use Normal again:

Wolfram Language code: Normal[na]

These objects are frequently used to interact with CUDA.

CUDAKernel

"CUDAKernel" objects are used to interact with CUDA hardware, but they are typically contained inside other objects. To work with them, you first compile code to an ExternalObject as follows:

Wolfram Language code:

gpuObj = ExternalEvaluate["CUDA", ExternalOperation["Program", Function[{Typed[x, "CArray"::["Real32"]], Typed[n, "MachineInteger"]}, 
	Module[{id}, 
	id = LibraryFunction["BlockDimensions.x"][] * LibraryFunction["BlockID.x"][] + LibraryFunction["ThreadID.x"][];
	If[id < n, 
	ToRawPointer[ x, id, Sin[FromRawPointer[ x, id]]]
	];
	]
	]]]

One way to work with the kernel objects is with the "Functions" property:

Wolfram Language code: gpuObj["Functions"]

However, it is possible to obtain the actual "CUDAKernel" objects using the "Kernels" property:

Wolfram Language code: gpuObj["Kernels"]

The result is an association of name/kernel pairs. This extracts an individual kernel:

Wolfram Language code: cuKern = gpuObj["Kernels"]["Main"]

This can be run on the hardware. This uses a "CUDAVector":

Wolfram Language code:

data = Range[0., 2., 0.1];
cuVec = CUDAVector[data, "Real32"]

The kernel can then be launched on the hardware:

Wolfram Language code: cuKern["Launch", cuVec, Length[cuVec]]

The results were written in the input array and can be seen as follows:

Wolfram Language code: Normal[Normal[cuVec]]

This is the most efficient way to execute code on CUDA hardware.

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

CUDA Programming

Basics

GPUArray

ExternalEvaluate for CUDA

macOS

CUDA SDK Paclet

Verifying Support

macOS

Threads

Arguments

Wolfram Language Types

C++ Types

Creating Data

Creating Kernels

ExternalOperation Function

ExternalOperation Program

CUDAVector

CUDAKernel