TODO: Fix formatting of body content.

Here at Imaginea Labs, we are exploring the true power of gpu-computing by analysing its way of launching a kernel, running 1000’s of thread blocks simultaneously, threads doing memory accesses in terms of global/shared and the optimized way to do so, doing atomic operation in threads when needed, optimizing threads to run more efficiently and faster.

In this post, we talk about cuda architecture and various experiments done on a use case with a brute-force approach to test and explore the gpu computation limits. We've had modest success in bringing out the best of gpu and faced some intriguing situations and results along the way.

Table of contents

  1. Introduction
  2. Device Specification
  3. Programming Flow
  4. Launching a Kernel
  5. Launch Parameters
  6. Memory Access
  7. Launch Process
  8. The Problem Statement
  9. The Solutions
  15. Learnings
  16. Conclusion
  17. References


GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate deep learning, analytics, and engineering applications. Parallel algorithms running on GPUs can often achieve up to 100x speedup over similar CPU algorithms, with many existing applications for physics simulations, signal processing, financial modeling, neural networks, and countless other fields.

Cuda is a platform to interact with GPU also called DEVICE layer using majorly C/C++ language.

Unlike CPU(HOST), this is fast because of its multi threaded architecture which can run these threads in parallel and synchronize (when needed) manner. This is mainly used when large number of independent tasks to be performed. For example, texture manipulation where each pixel for the next frame is calculated independently.

Device Specification

NVIDIA GTX: 1080 Ti: SM(Streaming Processor) = 28, Cores OR SP(Stream Processor) =  3584, Cores/SM = 128, ThreadBlock(blocks) = 32 threads warp split, Concurrent Blocks/SM = 32, Concurrent warps/SM = 64.

Programming Flow

The basic flow in for any gpu computing technique follows three steps:
Data preparation: Load data into cpu memory and copy data to gpu memory.
Data computation: Launch the kernel(function to be executed on gpu) to do the required computation on the element of interest in data loaded in gpu.
Data retrieving: The computation results are saved in gpu memory, so copy the data back to cpu memory and use it however is needed.

Launching a Kernel

Kernel is just like any other function. This kernel is executed on gpu if __ global __ prefix is added to the function. Unlike normal function which does the computation for whole data to finish all task, the kernel would written to do computation for only one independent task. There is a very slight difference in how a function and kernel called. The way kernel is called(launched) is func_name<<<>>>(func_param1, func_param2, ...). The <<<>>> symbol is launch configuration and it is tell gpu how this kernel will be called. Generally there are 4 parameters that can be passed to gpu in between the symbol.

  1. GridConfig: It’s the number of Blocks to be launched.
  2. BlockConfig: It’s the number of Threads to be launched in a single Block.
  3. Smem(Optional): It’s the address of shared memory to be used.
  4. Stream(Optional): It’s the address of stream in which the kernel will be launched.

For example, <<<10, 34>>> will create 10 Blocks with each block having 34 threads, i.e., (10 x 34) threads in total. This launch will run 340 instances of same kernel in parallel. These launch parameters are read by cuda in dim3(Vector3) fashion, so, the above example is read as (<<< (10, 1, 1), (34, 1, 1)>>>*. In function one can access these threads from inbuilt variables. For ex., threadIdx.x will give current thread index in x direction. Similarly there are blockIdx, blockDim, gridDim.
One is allowed to run 1024 threads in a BLOCK, 65536 blocks in a 1D GRID.
In cuda the basic unit for computation is a thread. Kernel launch create hierarchical groups of threads. Threads are grouped into Blocks, and Blocks into Grids. All threads execute the same  kernel code, but can take different paths.

Launch Parameters

  • Generally every kernel run in a thread.
  • CUDA has an architecture to manage these threads in BLOCKs, GRIDs, STREAMs.
  • One is allowed to run 1024 threads in a BLOCK, 65536, 65536 blocks in a 2D GRID and 1 kernel in a STREAM (16 streams in total).
  • These Threads, blocks and grids is something one passes as launch parameters as mentioned above.
  • For example, func_name<<<10, 34>>> will create 10 Blocks with each block having 34 threads, i.e., (10 x 34) threads count. This launch will run 340 instances of same kernel in parallel.
  • These launch parameters are read by cuda in dim3(Vector3) fashion, so, the above example is read as <<<(10, 1, 1), (34, 1, 1)>>>.
  • In function one can access these threads from inbuilt variables. For ex., threadIdx.x will give current thread index in x direction. Similarly there are blockIdx, blockDim, gridDim.

Memory Access

  • Local (On-Chip): Thread scope, managed by registers.
  • Shared/L1 (On-Chip): Block scope, shared between threads of a block.
  • Global/L2 (Off-Chip): Grid scope, shared between all threads & CPU copies.

Launch Process

  • The kernel launch from host is by default asynchronous in gpu. i.e., all instances of kernel are in parallel execution mode.
  • In gpu these kernels are given to SMs(28).
  • Each SM gets 32 blocks to process and other blocks have to wait till the previous ones are done.
  • In SM, the warp scheduler chooses a warp(group of 32 thread) from among all the warps in all the resident threadblocks on SM and the warp gets executed in lock-step execution mode in cores(128) on SM. i.e., 128/32 = 4 warps at a time simultaneously.
  • If there are only few threads per block then SM can have two or more blocks happening on it to utilize all cores.

The Problem Statement

The problem chosen is a clustering problem where entities need to be grouped based on the other entities that they're similar to.

Given a file of several million rows each having two columns, first contains “Id”(1 integer) and second contains “related Id’s”(1000 integers). Format of file is shown below:

1{2, 5, 6, 10, 11}
2{5, 10, 11, 15}
3{6, 10, 11, 17}
4{3, 10, 12, 17}

To do:

For each row in id, we have to generate a new column which has co_related_ids with a threshold as a param Say 2 for the current case
Logic: For id #1, Id # 2 & 3 are co_related as 2 & 3 has more than 2 related_ids in common with id #1
For #2 5,10 & 11
For #3, 6,10 & 11
For #4, Only 10 so not co_related_ids

Output should be as below

1{2, 3}
2{1, 3}
3{1, 2}

The Solutions

Algorithm: To test the gpu computation power the brute force implementation is chosen to solve the problem. Following are the series of experiments and learning from them to do next step. Each experiment has three parts WHY, HOW and ANALYSIS.


The initial thought was that since we cant load all 14 million rows into memory and do processing, we thought of doing it in chunks.
The idea of this experiment is to load number of rows as a chunk and do the comparison for one single row with first chunk and go till the end of file as chunks.

This experiment uses two loops to stream file data to CPU memory. One loops load the row(at array index 0)  to be compared with all other rows and second loops loads number of rows(at array indices from 1 to (n-1)) from same file to do the comparison.
The cuda kernel launch config is <<< chunk_size, 1>>>(rows[][], result[][])  and it has 2 loops to compare each related_id of one row with all other related_id's to find if there are at least two matching related_ids exist between the two rows. if match found then result[chunk_row_index][0] = 1.

#define THRESHOLD 2
#define RELATED_IDS 1000

__global__ void kern_2D(int **rows, int **res) {
  int row1Idx = blockIdx.x;

  res[row1Idx][0] = 0;
  int th = 0;
  int ff[2];

  // loop
  if (row1Idx > 0) {
    for (int i = 1; i < 1001; i++) {
      int val1 = rows[0][i];
      if (val1 == 0)
      //printf("trying for -> %d \n", val1);
      for (int k = 1; k < 1001; k++)
        int val2 = rows[row1Idx][k];
        if (val1 == val2) {
          ff[th] = val1;
          if (th >= THRESHOLD) {
            //printf("(%d = %d)found -> %d || %d \n", rows[row1Idx][0], row1Idx, ff[0], ff[1]);
            res[row1Idx][0] = 1;

// main function
int main()
  //declare pointers for rows and result
  int **rows, **res

  //load rows from file
  //declare (gpu)device pointers for rows and result
  int **d_rows, **d_res;
  //copy data from cpu to gpu 
  copyToGPU(rows, d_rows);
  //launch kernel (here THREADS represents no. of blocks, 1 is no. of threads in each block)  
  kern_2D<<<THREADS, 1 >>>(d_rows, d_res);
  //copy data back to cpu
  copyToCPU(d_res, res);


50 rows || GridSize(50,1,1), BlockSize(1,1,1) ->  6.65 seconds (chunk size 50) ==> 50 kernel calls. (time increases as no of chunk increases).

Similarly for 1000 rows || GridSize(1000,1,1), BlockSize(1,1,1) -> 315 seconds(chunk size 1000) ==> 1000 kernel calls.


This experiment aim to launch the kernel with 2D grid of blocks to process large amount of chunk of file and 1000 threads in a block to get rid of one loop in kernel.

kernel launch:

config is <<< dimGrid(blocks, blocks, 1), dimBlock(threads, 1, 1)>>>.

dimGrid and dimBlock are dim3(3D vector) to launch blocks and threads in x,y,z direction respectively.

#define LEN 1001
#defien THREADS 1000
#deine BLOCKS 1000 // no. of rows to be processed at a time.

__global__ void kern_2D(int **rows, int **res) {

  int row1Idx = (blockDim.x * blockIdx.x) / THREADS;
  int row2Idx = (blockDim.y * blockIdx.y);
  int cIndex = threadIdx.x; //  add 1 because we don’t want to compare index 0 which is a row id

  // loop
  if (row1Idx != row2Idx )
    //printf(" thread [%d, %d] = %d\n", row1Idx, row2Idx, threadIdx.x);
      int val1 = rows[row1Idx][cIndex + 1];
      if (val1 != 0) {
        for (int k = 1; k < LEN; k++)
          int val2 = rows[row2Idx][k];
          if (val1 == val2) {
            res[row1Idx][row2Idx] += 1;
            //printf("(!!!!!!!!!!!! %d [%d] found in %d [%d] -> %d #incr = %d\n", rows[row1Idx][0], row1Idx, rows[row2Idx][0], row2Idx, val1, res[row1Idx][row2Idx]);

int main()
  dim3 dimGrid(BLOCKS, BLOCKS, 1);

  dim3 dimBlock(THREADS, 1, 1);

  kern_2D << <dimGrid, dimBlock >> >(d_rows, d_res);


1k rows || dimGrid(1k, 1k, 1), dimBlock(1k, 1, 1) -> 3.6 seconds

Maximum blocks in x and y direction are 65335. i.e 65k rows in one kernel.

2k rows || dimGrid(2k, 2k, 1), dimBloc(1k, 1, 1) -> 14 seconds


This improves the loop efficiency in kernel.

In kernel:

row and column grid, instead of doing row1Idx != row2Idx, do row1Idx < row2Idx. i.e., doing only row1 with row2 And not not doing vise versa because results are same for both.

#pragma unroll on loop makes loop faster because of the enhancement of Instruction-Level Parallelism (ILP).

  if (row1Idx < row2Idx )
  #pragma unroll
  for (int k = 1; k < LEN; k++)


1k rows || dimGrid(1k, 1k, 1), dimBlock(1k, 1, 1) -> 1.7 seconds

4k rows || dimGrid(4k, 4k, 1), dimBlock(1k, 1, 1) -> 25 seconds

pragma unroll reduced the time by 20%

kernel function -> if(row2Idx < row2Idx) reduced the time by half.

For small size data set the result were correct and matching for all kernel launch. But It was observed that in case of big chunk as 1000 records, the results for experiment 14 were varying from one launch to another.


This was done to resolve the varying results in different kernel lunch happening in experiment 14. Its looked like the threads which were trying to updates the same record was not successful all the time. i.e., sometime it was failing reason being all threads running in parallel try to update same record in res[i][j] kernel.

So fix above issue:

Shared memory is used to update records for current thread in respective index of shared variable.

Then __syncthreads() is used to put stop to wait till all threads have written in shared memory and then used combined result to update the record(This fixed the issue).

Shared memory was used since 1000 related Ids are doing their comparison in their own thread, synchronization can be done since they are running in same block and are updating res[i][j] based on that blockIdx x, y.

  if (row1Idx < row2Idx)
    __shared__ int th[THREADS];
    th[cIndex] = 0;
    //printf(" thread [%d, %d] = %d\n", row1Idx, row2Idx, threadIdx.x);
    int v = rows[row1Idx][cIndex + 1];
    if(v != 0)
    #pragma unroll
    for (int k = 1; k < LEN; k++)
      if (v == rows[row2Idx][k])
        th[cIndex] = 1;

    // wait for all threads in same block to finish their comparison.
    int matches = 0;
    for (size_t i = 0; i < THREADS; i++)
      matches += th[i];
    res[row1Idx][row2Idx] = matches;
    res[row2Idx][row1Idx] = matches;

The atomic operation was also tried to update the record(This also fixed the issue). This is useful when threads in different blocks are trying to update the same record, but atomic operation adds more latency than shared memory since these operations are queued one after another to guarantee the update. This was achieved using atomicAdd(&res[i][j], 1) instead of res[i][j] += 1 increment.

  if (row1Idx < row2Idx)
    int v = rows[row1Idx][cIndex + 1];
    if(v != 0)
    #pragma unroll
    for (int k = 1; k < LEN; k++)
      if (v == rows[row2Idx][k])
        atomicAdd(&res[row1Idx][row2Idx], 1);
        atomicAdd(&res[row2Idx][row1Idx], 1);


The results were matching every time and correct.

Time taken is almost similar even with shared memory and atomic operation.

1k rows || dimGrid(1k, 1k, 1), dimBlock(1k, 1, 1) -> 1.5 seconds


This aim to try Coalesced memory access to see the kernel efficiency. Basically converted kernel to take args in 1D array instead of 2D. The coalesced memory access works is when there is memory access in kernel, the thread wrap does all the access together and It is faster if the pattern of index is 0, 1, 2, 3... consecutive instead of 0, 4, 1, 5.... unordered ref 1.

  int v = rows[row1Idx][cIndex * LEN + 1];

  if (v == rows[row2Idx * LEN + k])

  res[(row1Idx)* totalRows + (row2Idx)] = matches;
  res[(row2Idx)* totalRows + (row1Idx)] = matches;


nvcc compilation: didn't affect the time taken.

Make compilation (genecode PTX and SASS) ref 2:

1k rows || dimGrid(1k, 1k, 1), dimBlock(1k, 1, 1) ->

  • Atomic operation: 0.87 seconds
  • Shared memory operation:  0.66 seconds

50k rows || dimGrid(50k, 50k, 1), dimBlock(1k, 1, 1) -> failed.

  • Memory allocation error!
  • It was found that pointer allocation for 1D array, more than (24k * 24k) was failing since heap is not big enough to hold this much.
  • Converted 1D result array into 2D array(Worked) -> 26 minutes.

Equation for nk, time = n^2 x .66 seconds.Img: Exp 18 nvcc


Img: Exp 18 Make(Atomic Operation)


Img: Exp 18 Make(Shared Memory)



Kernel Launching:

While launching kernel, setting block-size(threads in a block) for “sweet spot” number of warps for optimal performance from SM is the priority and one should do some benchmarking/profiling to see where that is. The grid-size(blocks in a grid) is just for total amount of work needed.


Ideal case: The app should avoid any kind of synchronization between threads:

  • i.e., Each threads updates different record of global/L2 memory.
  • This way each thread update attempt is confirmed.
  • This is generally the case a kernel should run.

Non Ideal case: This adds extra latency in kernel execution.

  • When all these kernel launches and there are threads in blocks which updates the same record(L1/L2 memory), the update cannot be guaranteed. i.e., order of update is random and some time some threads will fail to update since they try to update at the same time.
  • This kind of behavior is sometime not visible in small kernel run since there is enough bandwidth to do update call. But if the scale of launch is high(lots of blocks), the results will start varying from one launch to another.

Case 1: When only threads within a block trying to update same record:

  • One synchronize these threads using __syncthreads().
  • One can create for example shared int variable(L1 memory) in kernel and access data from any of these threads.
  • One can put __syncthreads() to put a stop in a kernel to wait till all threads have reached to this step of execution and then resume execution.

Case 2: When threads from different blocks are trying to update the same record:

  • In this case __syncthreads() with shared memory does not help, but atomic operations can be used to update the record.
  • For example, increment of a record by these threads can be done by using atomicAdd(&record, 1). this queue the next atomic update call and guarantees the update.
  • As we can infer this does slows down the execution of kernel since all the atomic calls have to be finished.

Through pipelining:

  • Serial(1 way) -> Async(H2D) - K<<<>>> - Async(D2H)


  • (2 way) -> Async(H2D) - K1 - DH1 || K2 - DH2 || K3 -DH3
  • (3 way) ->  HD1 - K1 - DH1 || HD2 - K2 - DH2 || DH3 - K3 -DH3
  • K = Kernel, HD1 = H2D1, DH1 = D2H1 and || = kernel separator
  • Through streams(non-default streams)ref 3:
  • A stream is basically a queue of device work.
  • Operations within same stream are ordered(FIFO).
  • Manage streams: declare ->  cudaStream_t stream; allocate -> udaStreamCreate(&stream); destroy ->  cudaStreamDestroy(stream);
  • Memory copy -> cudaMemcpyAsync( d_a, h_a, size, H2D, stream);
  • Kernel launch ->  kernel<<< blocks , threads, 0, stream>>>();
  • Memory copy -> cudaMemcpyAsync( h_a, d_a, size, D2H, stream);
  • Important: use Pinned(Page-Locked) host memory instead of pageable(eg. malloc) .
  • i.e., allocate -> cudaMallocHost(&h_a, nBytes); and free up -> cudaFreeHost(h_a);
  • This allows any copy(data transfer) overlap from streams to occurs.


  • To have concurrency is to avoid default stream in the app.
  • It is important to realize that when do we use this type of execution. We know that the kernel launch is by default asynchronous by device. So the concurrency is not about the internal of kernel. i.e., if we launch the 2 kernel in different streams, it does not mean that they will run concurrently.
  • The device takes the first kernel from the first stream launched and takes the blocks and passes them to SM to process. i.e., if there are more number of blocks that SMs can process at a time then other kernels in different streams will not run concurrently.
  • The kernel concurrency is only witnessed when there are kernels running less blocks.
  • The data transfer and kernel concurrency is visible when different streams are used. but the visibility depends on the transfer time and kernel execution time. i.e., if there is huge gap in time taken by them, then the concurrency is not visible even though there is concurrency.


It was observed that SM processes threads in warps and it can occupancy of kernel is maximum if threads in a block is multiple of 32. A condition if without else doesn't affect the kernel execution very much. A direct write operation in a global memory by more than a single thread should not happen since it can lead to varying results. Threads updating one single global memory should be handled by Shared Memory if these threads belongs to same block or Atomic Operation if these threads belongs to different blocks. In terms of kernel concurrency, It is relative if more than two different kernels are launched and their theoretical occupancy is not 100%. And also it fades away as the number of thread block increases since SM can only process a certain amount of threads at a time. make having genecode configuration to create PTX(Parallel thread execution) and SASS(Executable object code) compilation made coalesced access faster than the normal nvcc compilation.