Lab 5 - HPC Center services¶

Overview¶

Jupyter
RStudio

Jupyter¶

What is Jupyter? How to gain access?¶

Danger

To log into ondemand.hpc.ut.ee use ETAIS LOGIN as your course account is tied to ETAIS.

Jupyter is a web-based interactive computational environment for creating notebook documents. It is a powerful tool that allows you to visualize your code, run it in blocks and see the results all in the same place. Jupyterlab and Jupyter notebooks allow you to run blocks in R alongside Python and support separating environments via kernels.

If you're a UT user, you get access to Jupyter via ondemand.hpc.ut.ee when you have access to the HPC cluster. Jupyter Notebook only offers a very simple interface. Users can open notebooks, terminals, and text files. Although Jupyter Notebook doesn't have all the functionalities of JupyterLab, it is enough for the purposes of this lab and for the introduction into Jupyter.

Jupyter architecture¶

When starting the Jupyter instance, you are presented with a web page, where you select what you want to run. You have the option of running terminals in either bash or python, viewing and editing files and running a kernel. A kernel is an execution environment, that defines the context that your notebook runs in. The kernels usually refer to a virtual environment, one of which we are going to create in this lab.

As for Jupyter Notebook, it is run within a batch job on the compute node. The user will then be able to launch a Jupyter Notebook Server from the Open Ondemand interface.

Complete

Begin by accessing ondemand.hpc.ut.ee and initiating a Jupyter Notebook server. Set Workdir to /gpfs/space/projects/hpc-course/<your_etais_username> and set 2 cores and 8 GB of memory. Leave the virtual environment path empty.
Create a new directory lab5 in your project working directory. Hint: use Jupyter Terminal for that
Create a new notebook using Python.
In the notebook, compose a straightforward Python script that outputs Hello, JupyterLab!.
Don't forget to save your notebook with the filename jupyter_lab5.ipynb to the lab5 directory.

Creating Custom Python Environments¶

If you have your own Conda or Python environment that you'd like to use within JupyterLab, you're in for an interesting adventure of creating a custom IPython kernel, particularly when it comes to customizing it to fit your needs.

Complete

Set up a Conda or Python environment within your user $HOME directory on the cluster following these steps in the Jupyter Terminal:
- Begin by loading the any/python/3.9.9 module.
- Create a Python environment named lab5env by executing the command: virtualenv lab5env.
- Remove the module to ensure you're using the newly created Python environment.
- Activate your freshly installed Python environment using the command: source lab5env/bin/activate.
Next, install the ipykernel package using pip, and then utilize the provided command to connect this installed ipykernel with your environment.
Execute the following command: python -m ipykernel install --user --name=lab5_kernel. At the end of the long output the installation produces Installed kernelspec lab5_kernel in <path> line.
Copy kernel.json file in the installed kernel path to the HPC course directory and lab5 folder.
Install the scrapy package into your environment using pip.
Open Jupyter, launch your server, load the kernel, and confirm that scrapy has been successfully installed by running import scrapy. If import function, doesn't produce output that means it worked. Also, you can run the same command under default Python kernel that should produce an error No module named "scrapy"

Note

If you are an AI enthusiast looking to utilize GPUs with packages like PyTorch or TensorFlow, additional steps are necessary to set up the CUDA driver path for the kernel. You can find detailed instructions in the Jupyter User Manual.

RStudio¶

What is RStudio and How to Gain Access?¶

Danger

To log into ondemand.hpc.ut.ee use ETAIS LOGIN as your course account is tied to ETAIS.

Access RStudio at ondemand.hpc.ut.ee. Start a RStudio app. RStudio serves as a compact development environment, offering a user-friendly interface for coding and light testing. It closely resembles the RStudio environment you may typically use on your laptop.

Common Errors and Troubleshooting¶

A problem that often affects RStudio users is the user state directory located at ~/.rstudio becoming filled. In our experience, if this directory approaches 5 GB in size, RStudio may become sluggish or even freeze during login. If you plan to use RStudio for your studies or research, please keep in mind that the only solution in both cases is to contact us by writing an email to support@hpc.ut.ee.

Running R Code on the HPC Cluster using Slurm Jobs¶

To run R code outside development environment, we encourage you to use the computational power of the HPC cluster, Slurm jobs can be employed. This is particularly useful when dealing with computationally intensive R scripts or analyses that require significant resources.

Here's a step-by-step guide to running your R code using Slurm jobs:

Create a Slurm Script: Start by creating a Slurm script (for example, my_r_job.slurm) that outlines the resources your job needs, such as the number of CPUs, memory, and estimated runtime. Specify the R script you want to run within the script as well.
Submit the Job: Use the sbatch command to submit your Slurm script to the cluster. For instance: sbatch my_r_job.slurm
Monitor Job Progress: After submission, you can monitor the status of your job using squeue -u your_username. This will show you the queue status, job ID, and other relevant information.
Retrieve Results: Once the job completes, the results (output/error logs in slurm-*******.out) will be generated. You can then access these files to analyze the outcomes of your R script.

A sample Slurm script (my_r_job.slurm) to run an R script named my_analysis.R might look like this:

#!/bin/bash
#SBATCH --job-name=my_r_job
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --time=2:00:00

module load r/4.1.3
module load r-bench/1.1.3
module load r-pillar/1.9.0
Rscript my_analysis.R

Remember to customize the Slurm script according to your specific requirements.

Using Slurm jobs not only optimizes the utilization of cluster resources but also allows you to run multiple tasks in parallel, saving you valuable time during your data analysis or simulations.

Please tailor this script according to your specific requirements. This approach not only optimizes resource utilization by the job but also allows you to run multiple tasks in parallel, saving you valuable time during data analysis or simulations.

R tasks¶

Let's compare the computational performance of basic mathematical functions between RStudio and running a job on the cluster. To facilitate this comparison, we'll utilize the R library 'bench' for benchmarking. This package employs a high-precision timer, allowing us to compare the execution time of operations that are nearly instantaneous.

Here is the R code for running example calculations and benchmarking their runtime.

library("bench") # loads bench library

x <- runif(100) # creates input vector
lb <- bench::mark(
    sqrt(x),
    x ^ 0.5,
    x ^ (1/2),
    exp(log(x) / 2)
)
lb[c("expression","total_time")] # prints only two essential columns for our test

Complete

Your task is to observe difference in computational time between RStudio and a Slurm job.

Log in to OpenOndemand at ondemand.hpc.ut.ee and start a RStudio server with 2 cores and 8 GB of memory.
Run the calculations in the GUI. When you run the same analysis in RStudio you need to install the bench package for your RStudio server instance beforehand. Use the command install.packages("bench").
Utilize knowledge from previous labs and the guide mentioned in this link to script an sbatch job for running the same benchmark. When running a Slurm job on the cluster save the R code as my_analysis.R. The R bench package will be loaded in the Slurm job with the command module load r-bench/1.1.3.
Compare the results from both RStudio and the Slurm job.
Save the total time of the best-performing approach in a file named r_benchmarking within your HPC course project directory in lab5. Use a variable total_time to define the time value in the file.

Lab 5 - HPC Center services¶

Overview¶

Jupyter¶

What is Jupyter? How to gain access?¶

Jupyter architecture¶

Creating Custom Python Environments¶

RStudio¶

What is RStudio and How to Gain Access?¶

Common Errors and Troubleshooting¶

Running R Code on the HPC Cluster using Slurm Jobs¶

R tasks¶

References¶