Skip to content

Lab 8 - Introduction to GPU Programming

Getting Started

For this lab we will be mostly looking at and trying to get used to the workflow using GPUs. This time the setup is not quite as easy as last time, but should still be quite fast. Using Lmod and pip for the following. If you call just pip, it might use the Python 2.7 version that is on the system by default. You can bypass this by using python -m pip install <package> instead. All of the results should be stored in a folder called lab8, like the last lab.


  • Get python/3.9.12
  • Get cuda/11.4.2
  • Get numba (pip)
  • Get cuda-python (pip) If you run into trouble with the installation, make sure you have CUDA_HOME environment variable set. Hint: you can deduce this from whereis nvcc if you have cuda loaded. The path should be to the folder that holds another folder calleb lib among others. You don't have to do this when installing in a GPU node.


The following shouldn't produce any errors:
python -c "import numba"

Finding GPU information

When working with GPUs, you will often need to find information about them. This can be done both by starting an interactive and checking or by attaching to a running GPU job interactively. We will be doing both in this lab, but let's try to former first. You can find everything you need for this in the documentation GPU section.


Run an interactive job on the GPU partition with 1 GPU and find the CUDA and driver version that the GPU has. Copy the command you ran and whole output of the command and save it into a file called NVIDIA.txt.
Hint: there is a command for this nvidia-...

This command is useful not only for solving software compatibility issues, but also to analyze the usage of the GPU. We will see how to do this later.

CUDA code

Vector Addition

Now, let's look at some GPU code. Below is a simple Python vector addition program that is implemented on the GPU. A lot of the code is done, you will just need to understand it and fill in the rest of the #TODO sections as instructed. All of the information you need is in the lecture slides. If you don't want to wait for the Slurm scheduler every

from numba import cuda
import numpy as np

def f(a, b, c):
    # Perform the addition, make sure we do not go out of bounds, use the size of the matrix for this and the index of the thread in the grid
    # TODO

N = 100000
a = cuda.to_device(np.random.random(N)) # Creating an array of N random numbers, and sending them to the GPU
b = cuda.to_device(np.random.random(N))
c = cuda.device_array_like(a)           # Creating an array in the GPU memory of the same structure as array a

threads = 256
blocks = (len(a) // threads) + 1 # Making enough blocks so that every element gets a thread
f[blocks, threads](a, b, c) 
print(c) # The result we want to see is stored in the variable c, but the output is strange. 
result = # Look at the lecture materials to see what the problem is and how we could understand and see the result.


  • Figure out what should be done with the result variable so that we can see the resulting array.
  • Run the code in the gpu partition with 1 gpu.
  • If you have found the function to use in result, change the name of the file to

Heat Dissipation

Next we are going to look at some code that solves Laplace's equation in one dimension for heat propagation through time(understanding this is not really important here), which deploys some nice techniques. The program essentially holds an array of floats, which you could interpret as temperatures, which move some of their value to nearby elements every loop cycle. There are just a few

import numpy as np
from numba import cuda

# Use an odd problem size.
# This is so there can be an element truly in the "middle" for symmetry.
size = 10001
data = np.zeros(size)

# Middle element is made very hot
data[5000] = 100000
buf_0 = cuda.to_device(data)

# This extra array is used for synchronization purposes
buf_1 = cuda.device_array_like(buf_0)
niter = 1000

def solve_heat_equation(buf_0, buf_1, timesteps, k):
    i = # TODO find the absolute index of the thread

    # TODO Don't continue if our index is outside the domain

    # Prepare to do a grid-wide synchronization later
    grid =

    for step in range(timesteps):
        # Select the buffer from the previous timestep
        if step % 2 == 0:
            data = buf_0
            next_data = buf_1
            data = buf_1
            next_data = buf_0
        # Get the current temperature associated with this point
        curr_temp = data[i]

        # Apply formula from finite difference equation
        if i == 0:
            # Left wall is held at T = 0
            next_temp = curr_temp + k * (data[i + 1] - (2 * curr_temp))
        elif i == len(data) - 1:
            # Right wall is held at T = 0
            next_temp = curr_temp + k * (data[i - 1] - (2 * curr_temp))
            # Interior points are a weighted average of their neighbors
            next_temp = curr_temp + k * (
                data[i - 1] - (2 * curr_temp) + data[i + 1]

        # Write new value to the next buffer
        next_data[i] = next_temp

        # Wait for every thread to write before moving on
        # TODO function that synchronizes all threads

# This is the forall call to the GPU function that automatically constructs the kernel.
   buf_0, buf_1, niter, 0.25

#TODO print the final data


  • There will be some missing functions in the code marked TODO. Fill those.
  • What is the point of the synchronization and the use of 2 buffers. Hint: it's a common problem in computer science that is made of 2 words
  • Look at where the GPU function is called. Currently it uses a forall loop. This way the code produces a warning. Run the code and put the warning it produces into a file called WARNING.
  • Define threads and blocks manually and call the GPU function using those in a way that no error is produced.
  • Change the name of the code file to be Leave it so that it has the GPU function call with manually defined kernels.


NVIDIA profiling tools on the cluster are available by default, when on the GPU nodes. You can download the UI tools for free from the NVIDIA website. Your task is to construct the kernels in such a way, that the GPU utilization is over 15%. We will use the compute tool here, because we are trying to analyze just 1 kernel. If we would want to look at a PyTorch program or something else on a bigger scale, it would make sense to try NVIDIA systems tool first. If you ran your code before with the for all kernel execution method, you probably saw a message saying the GPU is underutilized. You can also change the problem size variable for this if needed.


  • Construct the kernels and problem size in such a way, that the Compute Throughput shows over 15%.
  • Save the output from ncu analysis to a file called NCU.txt. The output should be human readable.

ML using GPUs


Though all of you will likely not be doing machine learning on our cluster and every software is somewhat different, let's use this as a general example. Look at our cluster's documentation and find where AI examples are located. Follow one of the guides and run it on the GPU partition with either 2 nodes and 1 GPU per node or 2 GPUs and 1 node. Use the native installations for now, we will do the container ones later. While the job is running we will have to attach to the job (interactive jobs at docs), and run the same command to inspect the GPUS that you did in Exercise 2. By doing this we can verify that our code is utilizing the GPUs. You should see a process running on the GPU(s). Save this output into a file called ANSWER.


  • Follow the AI examples guide for either PyTorch or TensorFlow and get it to run natively
  • Attach to the running job
  • Run the command to inspect the GPU

With containers

Now you have in the previous labs gotten to know containers and also just now briefly familiarized yourself with distributed ML code. You will also go through one of the examples and the container option for it. Though containers are not a solution for everything and do come with their own set of problems, they do offer some utility other than just being portable and reproducible. For example, you can use newer GPU driver versions than are available on LUMI, if you're using containers. This happens, because some dependencies that would otherwise exist, might not be there when working in a container environment. Thus, being able to use containers can be a valuable skill. If there exists a container, you can also avoid compiling scientific software, which most of the time is a pain.


  • Follow the AI examples guide for either PyTorch or TensorFlow and get it to run in a container
  • Attach to the running job
  • Run the command to inspect the GPU
  • Find out what flag enables the use of NVIDIA GPUs within a singularity container? Set the answer into a text file called FLAG.txt