1D, 2D and 3D thread allocation for loops in CUDA

7 min readAug 26, 2018

In a recent optimization project that I worked on, I had to parallelize nested for loops using CUDA. Then the issue was getting nested for loop indices using thread indices. Here, I am going to discuss different ways of doing thread allocation to satisfy the requirement of getting the indices of the loops using thread indices. For that I assume that you know how to configure and run a CUDA program. Here I focus on different thread allocations. Let’s begin.

In order to get the loop indices from thread ids and block ids, you should first have a good understanding about what is a thread block, thread grid, thread id, block id and block dimension etc.

What is a thread block?

One thread block consists of set of threads. Those threads may be in 1D, 2D or 3D. When we consider a thread block, threadIdx and blockDim standard variables in CUDA can be considered very important.

threadIdx = Used to access the index of a thread inside a thread block

threadIdx.x = Index of a thread inside a block in X direction
threadIdx.y = Index of a thread inside a block in Y direction
threadIdx.z = Index of a thread inside a block in Z direction

blockDim = Number of threads in the block for a specific direction

blockDim.x = Number of threads in the block for X direction
blockDim.y = Number of threads in the block for Y direction
blockDim.z = Number of threads in the block for Z direction

Let’s see how those are being used in the context.

For thread 0, threadIdx.x = threadIdx.y = threadIdx.z = 0. But for the thread 3, threadIdx.x = 4 and threadIdx.y = threadIdx.z = 0. And blockDim.x = 5 and blockDim.y = blockDim.z = 1.

For thread 1, threadIdx.x = threadIdx.y = threadIdx.z = 0. For thread 6, threadIdx.x = 2, threadIdx.y = 1 and threadIdx.z = 0. And also blockDim.x=3 and blockDim.y=3.

Here, thread block is a cuboid of threads. Hope you will be able to imagine the situation. This is nothing but threads in all x, y and z directions.

Now, what is a thread grid?

Similar to before, thread grid is a set of thread blocks. Blocks also can be in 1D, 2D or 3D (Imagine replacing threads by thread blocks in the previous clarification for thread blocks). When it comes to thread grid, following variables are important.

blockIdx = Used to access an index of a thread block inside a thread grid

blockIdx.x = Index of a tread block in X direction
blockIdx.y = Index of a tread block in Y direction
blockIdx.z = Index of a tread block in Z direction

gridDim = Number of thread blocks in a specific direction.

gridDim.x = Number of thread blocks in X direction
gridDim.y = Number of thread blocks in Y direction
gridDim.z = Number of thread blocks in Z direction

When the kernel is initiated, we can set the dimensions for thread blocks and thread grids. We will see that later.

One for loop

Let’s start with one for loop. Our target is to get the i value inside the CUDA kernel. Here is the code for that.

Here is the output of the program.

It is not guarantee that the order of i will be same as for the CPU one. Grid dimension is set to (1, 1, 1). That means only one thread block is created. What is the dimension of the thread block? That has been specified as (4, 1, 1). This means 4 threads are created in x direction. Therefore,

blockDim.x = 4, blockDim.y = blockDim.z =1.
blockIdx.x = blockIdx.y = blockIdx.z = 0.
threadIdx.y = threadIdx.z = 0, threadIdx.x varies 0–3 (inclusive).

How I have taken the i value? I have added the thread index for X direction to the multiplication of block index in X direction and block dimension for X direction.

Two nested for loops

Now we can move to nested for loops. Let’s consider two loops. Inside the innermost loop, what we have are the combinations of i and j.

Here I have considered of using a grid(1, 1, 1) with only one two dimensional block(4, 4, 1). That means, in this configuration, threadIdx.x and threadIdx.y vary from 0 to 3. When the kernel gets executed 16 threads (4 * 4) will be invoked in a two dimensional thread block.

We can achieve the same thing by increasing the number of blocks in the 2D space as well. Following shows that.

Here I have created two blocks (that means dimension of grid is set to (2, 2, 1)). Therefore, in this approach,

threadIdx.x and threadIdx.y vary from 0 to 1.
blockIdx.x and blockIdx.y also vary from 0 to 1.

For example, how i = 3 and j = 2 are taken is,

For i,

threadIdx.x = 1, blockIdx.x = 1 and blockDim.x = 2. Then i = 3{1 + 1 * 2}.

For j,

threadIdx.y = 0, blockIdx.y = 1 and blockDim.y = 2. Then j = 2{0 + 1 * 2}.

Three nested for loops

Let’s look at representing three nested for loops in 3D space.

Same as before, we can use two blocks instead of one block.

Here I have used 8 thread blocks of size (1, 1, 1). But i, j and k values are taken same as before. This is very important because we can not allocate threads for a block more than a certain limit (1024 threads per block). If we want threads more than the limit, we can increase the number of blocks like above.

In all the above approaches, when there is only one for loop, we allocated threads in one direction(X) only. When there is a two nested for loop, we allocated threads in X and Y dimensions. When there is a three nested for loop, we allocated threads in all the three dimensions(X, Y and Z). But when it comes to 2D or 3D, we can consider of allocating threads in only one dimension in GPU. Let’s look in to that.

Consider the following loops.

As you can see, both gives the same i and j values. That means we can represent nested loops using only one dimension. Because one for loop can be represented by one dimension in CUDA.

Following shows the same thing with three nested for loops.

Let’s look at an example of how to do that with a CUDA kernel.

In above approach, only one block is created with all the required number of threads in x direction. But there is limit (1024) for number of threads in a block. Therefore, you may have to create set of blocks when the number of threads required is higher than the limit. In the following code we can set the number of threads per block. I have set it to 1024.

This works for any value of limitX, limitY and limitZ. Since it is possible to create threads more than required in this approach, we have to check whether threadId will go beyond our limit because otherwise it is possible that we may get values for i, j and k more than required.

I think I have covered different approaches of thread allocation and getting loop indices using thread indices and block indices. Just try those yourself and get a better understanding of how those are being done. I hope my code examples will be very helpful for that.

Happy coding 😃

1D, 2D and 3D thread allocation for loops in CUDA

One for loop

Two nested for loops

Three nested for loops

Written by Eranga Dulshan

Responses (7)