Skip to content

Lab 5 - HPC Center services

Overview

Jupyter

What is Jupyter? How to gain access?

jupyter.hpc.ut.ee is a website that provides a special online workspace called JupyterLab for many users. Behind the scenes, a system called JupyterHub helps create, manage, and connect these individual workspaces, which we call single-user servers. This platform has a user-friendly interface, allowing you to open your own single-user server inside the powerful HPC cluster managed by Slurm. Importantly, JupyterHub supports both Python and R programming languages.

If you're a UT user, you get access to JupyterHub when you access the HPC cluster. ETAIS users, however, have a slightly different procedure. To get access to Jupyter, ETAIS users should send a request to support@hpc.ut.ee, and they will provide you with a personalized solution. Students will get their password in their $HOME directory on the cluster.

Jupyter architecture and What is single-user-server?

JupyterHub is a key player in efficiently handling the resources provided by Slurm for each JupyterLab user on the HPC cluster. It involves several essential components working together to ensure user authentication, the creation of individual JupyterLab environments (single-user servers), and the establishment of connections between users and JupyterLab.

To help you visualize this process, imagine a high-level overview of the different Jupyter components and how they interact:

A crucial part of this system is the "Spawner." This component is responsible for launching each unique JupyterLab server. Think of the single-user server as your own personal instance of the JupyterLab application.

Jupyter Profiles

Jupyter profiles is a way to selected a predefined resource allocation for your JupyterLab. Currently, there are 5 available options:

Profile CPU cores Memory Timelimit Notes
Default 1 8 GB 6 h
Large memory 2 32 GB 24 h
Multicore 8 16 GB 24 h
GPU 2 40 GB 6 h 1 GPU
Testing 1 8 GB 2 h

Complete

  1. Begin by accessing jupyter.hpc.ut.ee and initiating a server with the Testing profile. The password is located in your $HOME directory with extension jupyter.pass
  2. Create a new directory lab5 in hpc-course/<etais_user> directory. Hint: use Jupyter Terminal for that
  3. Create a new notebook using Python.
  4. In the notebook, compose a straightforward Python script that outputs Hello, JupyterLab!.
  5. Don't forget to save your notebook with the filename jupyter_lab5.ipynb. Unfortunately, Jupyter notebooks do not support saving files anywhere else except user's $HOME. However, one can use Terminal to navigate over the cluster system.
  6. Utilise Terminal to create copy jupyter_lab5.ipynb to the HPC course directory.

Creating Custom Python Environments

If you have your own Conda or Python environment that you'd like to use within JupyterLab, you're in for an interesting adventure of creating a custom IPython kernel, particularly when it comes to customizing it to fit your needs.

Complete

  1. Set up a Conda or Python environment within your user $HOME directory on the cluster following these steps:

    • Begin by loading the any/python/3.9.9 module.
    • Create a Python environment named lab5env by executing the command: virtualenv lab5env.
    • Remove the module to ensure you're using the newly created Python environment.
    • Activate your freshly installed Python environment using the command: source ~/lab5env/bin/activate.
  2. Next, install the ipykernel package using pip, and then utilize the provided command to connect this installed ipykernel with your environment.

  3. Execute the following command: python -m ipykernel install --user --name=lab5_kernel. At the end of the long output the installation produces Installed kernelspec lab5_kernel in <path> line.

  4. Copy kernel.json file in the installed kernel path to the HPC course directory.

  5. Install the scrapy package into your environment using pip.

  6. Open Jupyter, launch your server, load the kernel, and confirm that scrapy has been successfully installed by running import scrapy. If import function, doesn't produce output that means it worked. Also, you can run the same command under default Python kernel that should produce an error No module named "scrapy"

Note

If you are an AI enthusiast looking to utilize GPUs with packages like PyTorch or TensorFlow, additional steps are necessary to set up the CUDA driver path for the kernel. You can find detailed instructions in the Jupyter User Manual

RStudio

What is RStudio and How to Gain Access?

Access RStudio at https://rstudio.hpc.ut.ee. RStudio serves as a compact development environment, offering a user-friendly interface for coding and light testing. It closely resembles the RStudio environment you may typically use on your laptop.

RStudio runs within a virtual machine (VM) that is equipped with the following hardware specifications: * 10 CPU cores * 19 GB of RAM

Common Errors and Troubleshooting

The most common issue is running code that consumes all available memory, causing RStudio to freeze. Another familiar problem that often affects RStudio users is the user state directory located at ~/.rstudio becoming filled. In our experience, if this directory approaches 5 GB in size, RStudio may become sluggish or even freeze during login. If you plan to use RStudio for your studies or research, please keep in mind that the only solution in both cases is to contact us by writing an email to support@hpc.ut.ee.

Running R Code on the HPC Cluster using Slurm Jobs

To run R code outside development environment, we encourage you to use the computational power of the HPC cluster, Slurm jobs can be employed. This is particularly useful when dealing with computationally intensive R scripts or analyses that require significant resources.

Here's a step-by-step guide to running your R code using Slurm jobs:

  1. Create a Slurm Script: Start by creating a Slurm script (for example, my_r_job.slurm) that outlines the resources your job needs, such as the number of CPUs, memory, and estimated runtime. Specify the R script you want to run within the script as well.

  2. Submit the Job: Use the sbatch command to submit your Slurm script to the cluster. For instance: sbatch my_r_job.slurm

  3. Monitor Job Progress: After submission, you can monitor the status of your job using squeue -u your_username. This will show you the queue status, job ID, and other relevant information.

  4. Retrieve Results: Once the job completes, the results (output/error logs in slurm-*******.out) will be generated. You can then access these files to analyze the outcomes of your R script.

A sample Slurm script (my_r_job.slurm) to run an R script named my_analysis.R might look like this:

#!/bin/bash
#SBATCH --job-name=my_r_job
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --time=2:00:00

module load any/R/4.0.3
Rscript my_analysis.R
Remember to customize the Slurm script according to your specific requirements.

Using Slurm jobs not only optimizes the utilization of cluster resources but also allows you to run multiple tasks in parallel, saving you valuable time during your data analysis or simulations.

Please tailor this script according to your specific requirements. This approach not only optimizes resource utilization by the job but also allows you to run multiple tasks in parallel, saving you valuable time during data analysis or simulations.

R tasks

RStudio has two primary limitations: memory and CPU. The concept of memory is quite intuitive; the more memory available, the larger datasets your code can handle and manipulate. However, as we aim to maintain RStudio stability for all students, we won't provide examples that could potentially crash or disrupt the RStudio environment.

Instead, let's compare the computational performance of basic mathematical functions between RStudio and running a job on the cluster. To facilitate this comparison, we'll utilize the R library 'bench' for benchmarking. This package employs a high-precision timer, allowing us to compare the execution time of operations that are nearly instantaneous.

library("bench") # loads bench library

x <- runif(100) # creates input vector
lb <- bench::mark(
    sqrt(x),
    x ^ 0.5,
    x ^ (1/2),
    exp(log(x) / 2)
)
lb[c("expression","total_time")] # prints only two essential columns for our test

Complete

Your task is to observe difference in computational time between RStudio and a Slurm job.

  1. Due to technical issues ETAIS user are unable to login into RStudio, so this task is optional. If you are interested in performance difference. Here are the resource from RStudio:

    # A tibble: 4 × 2
    expression    total_time
      <bch:expr>      <bch:tm>
    1 sqrt(x)           79.4ms
    2 x^0.5            193.4ms
    3 x^(1/2)          226.4ms
    4 exp(log(x)/2)    161.7ms
    

  2. Utilize knowledge from previous labs and the guide mentioned in this link to script an sbatch job for running the same benchmark.

  3. Compare the results from both RStudio and the Slurm job.

  4. Save the total time of the best-performing approach in a file named r_benchmarking within your HPC course directory.

References