Skip to content

TensorFlow on Rocket

Introduction

TensorFlow is an end-to-end open source platform for machine learning.

Installation

TensorFlow can be installed by the users with the instructions provided here(https://www.tensorflow.org/install/pip), but broadly just:

module load python/3.9.12
python -m pip install tensorflow

You will have to keep in mind version compatibility though, so if something ends up not working, that might be the reason.

Containers

Using Singularity we can then pull Docker containers and it will automatically translate them into a Singularity format. For this we are going to use NVIDIA official containers that include TensorFlow built to run on GPUs. We also set the environment variable $SINGULARITY_TMPDIR and $SINGULARITY_CACHEDIR to direct unneeded data to a familiar place, which we can get rid of later.

module load singularity
mkdir tmp cache
export SINGULARITY_CACHEDIR=$PWD/cache
export SINGULARITY_TMPDIR=$PWD/tmp
singularity pull docker://nvcr.io/nvidia/tensorflow:23.08-tf2-py3
rm -rf tmp cache

Code examples

TensorFlow unlike PyTorch can run multiple GPUs on a single machine much more efficiently. Read about distributed TensorFlow training here. If we were to scale across multiple nodes, we would have to look towards MultiWorkerMirroredStrategy.

Example code in this case would be:

distributed.py

import tensorflow as tf
import keras

def get_compiled_model():
    # Make a simple 2-layer densely-connected neural network.
    inputs = keras.Input(shape=(784,))
    x = keras.layers.Dense(256, activation="relu")(inputs)
    x = keras.layers.Dense(256, activation="relu")(x)
    outputs = keras.layers.Dense(10)(x)
    model = keras.Model(inputs, outputs)
    model.compile(
        optimizer=keras.optimizers.Adam(),
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[keras.metrics.SparseCategoricalAccuracy()]
    )
    return model


def get_dataset():
    batch_size = 32
    num_val_samples = 10000

    # Return the MNIST dataset in the form of a `tf.data.Dataset`.
    (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

    # Preprocess the data (these are Numpy arrays)
    x_train = x_train.reshape(-1, 784).astype("float32") / 255
    x_test = x_test.reshape(-1, 784).astype("float32") / 255
    y_train = y_train.astype("float32")
    y_test = y_test.astype("float32")

    # Reserve num_val_samples samples for validation
    x_val = x_train[-num_val_samples:]
    y_val = y_train[-num_val_samples:]
    x_train = x_train[:-num_val_samples]
    y_train = y_train[:-num_val_samples]
    return (
        tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size),
        tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(batch_size),
        tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(batch_size),
    )

def generate_gpu_list(num_gpus):
    # MirroredStrategy hangs inside a container, if it doesn't have all GPUs allocated and we don't set this explicitly
    return ["/GPU:"+str(i) for i in range(num_gpus)]



# Create a MirroredStrategy.
gpuslen = len(tf.config.list_physical_devices('GPU'))
strategy = tf.distribute.MirroredStrategy(generate_gpu_list(gpuslen))
print("Number of devices: {}".format(strategy.num_replicas_in_sync))
# Open a strategy scope.
with strategy.scope():
    # Everything that creates variables should be under the strategy scope.
    # In general this is only model construction & `compile()`.
    model = get_compiled_model()
# Train the model on all available devices.
train_dataset, val_dataset, test_dataset = get_dataset()
model.fit(train_dataset, epochs=2, validation_data=val_dataset)
# Test the model on all available devices.
model.evaluate(test_dataset)

batch.sh

#!/bin/bash
#SBATCH --job-name=tf-distr-cnn
#SBATCH --ntasks=1
#SBATCH --time=2:00:0
#SBATCH --partition gpu
#SBATCH --exclusive
#SBATCH --gres gpu:tesla:8
#SBATCH -w falcon5
#SBATCH -o container.out


export NCCL_DEBUG=INFO

# running a pip installation
module load python/3.9.12
module load cudnn

srun python3.9.12 distributed.py


### OR ifrunning inside of a container
#module load singularity

#srun --mpi=pmi2 singularity exec --nv tensorflow_23.08-tf2-py3.sif python distributed.py