Objective
The object of this tutorial is to familiarize the user to some of the nuances of the Wave HPC using a basic Machine Learning example including
- Working with external data sets
- Working with pre installed modules
- Using Slurm to access compute nodes
This tutorial is an adaptation of the NumPy Tutorial from Tensorflow.org.
To run this tutorial, I assume you already have access to the WAVE HPC with a user account and the ability to open a terminal session on one of the login nodes in the WAVE cluster. See WAVE HPC User Guide - Accessing the HPC if you require help on accessing the HPC.
Download the dataset into a local filesystem
Let's start by downloading the mnist dataset to a local directory within the datasets filesystem.
mkdir /WAVE/datasets/<your dataset directory>/mnist
wget -O /WAVE/datasets/<your dataset directory>/mnist/mnist.npz https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
ls -l /WAVE/datasets/<your dataset directory>/mnist/
Note: In this tutorial, you would need to substitute appropriate project and dataset subdirectories. A possible workaround if you're just getting started is to use the /WAVE/workarea/users filesystem to store both your dataset and project. More info can be found in section: Managing Files.
At this point you should have a local copy of the external data.
[<username>@login2 ~]$ ls -l /WAVE/datasets/<your dataset directory>/mnist/
total 11224
-rw-rw-r--. 1 <username> <group> 11490434 May 30 2018 mnist.npz
[<username>@login2 ~]$
Establish a projects folder
While we're at it, let's establish a subdirectory within our projects folder to hold our working files and switch to that folder.
mkdir -p /WAVE/projects/<your project directory>/mnist-tutorial
cd /WAVE/projects/<your project directory>/mnist-tutorial
At this point we should be working out of the “mnist-tutorial” projects folder
[<username>@login2 mnist-tutorial]$ pwd
/WAVE/projects/<your project directory>/mnist-tutorial
[<username>@login2 mnist-tutorial]$
Working with Preinstalled Modules
The WAVE HPC has pre-installed software covering many parallel and scientific computing needs which are available via modules. Use the following command to see which modules are available:
module available
In this tutorial we will be using TensorFlow which an open source platform for machine learning.
Load TensorFlow module
There are several versions of TensorFlow available on the WAVE HPC. We will use the latest, default version. Use the following comand to load the TensorFlow software.
module load TensorFlow
At this point we should have TensorFlow, plus some other dependent modules installed and ready for our use. Let's check that, first by listing what modules are installed. We could do that with the module list command which will list all the software packages that came with the TensorFlow module. That's interesting, but what is probably more important is specific packages required by our program. Are they there and are they compatible? Let's write a quick Python script that will check the installation for those specific modules.
In the Python code below, we are interested in importing tensorflow and numpy. Let's use the following code to check the software installation.
# check versions
import tensorflow
print('tensorflow: %s' % tensorflow.__version__)
import numpy
print('numpy: %s' % numpy.__version__)
Using an editor we'll add the code to a file versions.py. Note that I am executing out of the /WAVE/projects/<your project directory>/mnist-tutorial directory.
The following command will execute our file.
python versions.py
Below is the result of that command.
[<username>@login2 mnist-tutorial]$ python versions.py
2021-10-13 13:08:24.307690: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
tensorflow: 2.3.2
numpy: 1.17.3
[<username>@login2 mnist-tutorial]$
At this point we a local copy of the data and we are able to load the required modules. Now we turn to our attention to our sample python model.
Sample Python Program
Using an editor we'll add the following code to a file mnist-tutorial.py. This is our sample python program. This program is slightly different from the tutorial presented in https://www.tensorflow.org/tutorials/load_data/numpy, the main difference lying in how the data is loaded. The tensorflow.org tutorial relies on a keras utility that will load data from an external url. In the HPC, however, we will be running this program from a compute node which does not have external internet access. As such, we imported the data form a local copy stored in the datasets filesystem and we'll access the fromm there.
Note: It is not the intention of this tutorial to teach the user how to use TensorFlow for machine learning. Instead we are focused specifically on the nuances of running something like TensorFlow within the HPC. If you are interested in understanding more about how this code does machine learning, I would refer you back to TensorFlow.org.
# Set up
import numpy as np
import tensorflow as tf
import os
# Load Data from .npz file
data_dir = '/WAVE/datasets/<your dataset directory>/mnist/'
from tensorflow.keras.datasets import mnist
(train_examples, train_labels), (test_examples, test_labels) = mnist.load_data(path=data_dir+'mnist.npz')
# Load NumPy arrays with tf.data.Dataset
train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_examples, test_labels))
# Use the datasets
# Shuffle and batch the datasets
BATCH_SIZE = 64
SHUFFLE_BUFFER_SIZE = 100
train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)
# Build and train a model
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(optimizer=tf.keras.optimizers.RMSprop(),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['sparse_categorical_accuracy'])
model.fit(train_dataset, epochs=10)
model.evaluate(test_dataset)
We are now ready to run our tutorial program. Because this is such a simple program, we could run it from a login node, but we don't want to do that. The login nodes in the WAVE HPC are for setting up and configurating our environment. They are not appropriate for compute intensive tasks. Instead the login nodes are our gateway to the compute resources in the WAVE cluster. We will use a resource scheduling program called Slurm to gain access to those compute resources.
Using Slurm to access compute nodes
Slurm provides us the ability to execute a job on the backend compute nodes from either an interactive or batch perspective. We'll look at both approaches here, but in general a batch approach is more appropriate for longer compute intensive tasks.