
From June 27-29, 2022 I attended the 2022 Cyberinfrastructure-Enabled Machine Learning (CIML) Summer Institute hosted by the San Diego Supercomputer Center (SDSC) at the University of California San Diego via Zoom. The goal of CIML was to teach best practices for scalable machine learning on high-performance computing (HPC) resources. In the future, I may write blogs on other interesting topics from CIML as I apply them to my research. This blog is about how to run Jupyter notebooks using batch jobs on a supercomputer. The prerequisites for the concepts covered in this blog include:
- BASH scripting
- Conda environments
- Jupyter notebooks
- Python
- High-performance computing

Machine learning (ML) and artificial intelligence (AI) are becoming as ubiquitous, in the computational sciences, as HPC. One common tool for running ML/AI tasks, ran either on HPC resources or locally, are interactive notebooks such as Jupyter notebooks. On HPC resources, Jupyter notebooks are often ran using an interactive session hosted by Open OnDemand (http://openondemand.org/) or via an interactive SLURM job. While it is usually sufficient to run one Jupyter notebook at a time, sometimes we would like to change a variable or two and test the results. One way you can do this in an automated fashion, is with Papermill (https://github.com/nteract/papermill). Papermill is an interesting tool for parameterizing and executing Jupyter notebooks. Since the documentation on parameterizing Jupyter notebooks can be found in the documentation in the GitHub repository documentation above, I will not discuss this here.
In this example, I use a SLURM script called batch_cpu.sh to run a set of Jupyter notebooks with data from the GDB-111,2 database ranging from molecules with 5 to 11 atoms. This is set up similar to a normal SLURM script with the addition of setting up a conda environment in an automated fashion.
#!/bin/bash
# Batch script to run Jupyter Notebooks on a CPU node.
#SBATCH --account=ACCOUNTNAME # The project account to be charged
#SBATCH --job-name=batch_jupyter
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=4 # cpus per node
#SBATCH --partition=CONDONAME # If not specified then default is "campus"
#SBATCH --qos=QUALITYOFSERVICENAME
#SBATCH --time=0-90:00:00 # Wall time (days-hh:mm:ss)
#SBATCH --error=job.e%J # The file where run time errors will be dumped
#SBATCH --output=job.o%J # The file where the output of the terminal will be dumped
# Purge modules (I always keep intel-compilers loaded out of habit)
module purge
module load intel-compilers
# To make the environment.yml file:
# conda env export | grep -v "^prefix: " > environment.yml
# specify name of Conda environment, path to environment.yml file,
# notebook directory and a results directory
CONDA_ENV="ENVNAME"
REPO_DIR="./"
CONDA_YML="${REPO_DIR}/environment.yml"
export LOCAL_SCRATCH_DIR="/path/to/scratch/${USER}/cpu"
# download miniconds3
if [ ! -f "Miniconda3-latest-Linux-x86_64.sh" ]; then
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
fi
# install miniconda3 on node local disk
export CONDA_INSTALL_PATH="${LOCAL_SCRATCH_DIR}/miniconda3"
export CONDA_ENVS_PATH="${CONDA_INSTALL_PATH}/envs"
export CONDA_PKGS_DIRS="${CONDA_INSTALL_PATH}/pkgs"
./Miniconda3-latest-Linux-x86_64.sh -b -p "${CONDA_INSTALL_PATH}"
source "${CONDA_INSTALL_PATH}/etc/profile.d/conda.sh"
# use mamba to create conda environment
${CONDA_INSTALL_PATH}/bin/conda install mamba -n base -c conda-forge --yes
${CONDA_INSTALL_PATH}/bin/conda env create -f ${CONDA_YML}
# run notebooks using papermill
${CONDA_INSTALL_PATH}/bin/mamba init
${CONDA_INSTALL_PATH}/bin/mamba activate ${CONDA_ENV}
${CONDA_INSTALL_PATH}/bin/conda install papermill -c conda-forge --yes
conda info
# Loop for 5-9
for i in {5..9}; do
echo "Creating ${i}"
export resultdir="GDB0${i}"
if [ ! -d ./${resultdir} ]; then
echo "${resultdir} does not exist!"
mkdir ${resultdir}
fi
export GDB1="0${i}"
export GDB2="${i}"
${CONDA_ENVS_PATH}/${CONDA_ENV}/bin/papermill project.ipynb ${resultdir}/project.ipynb -r GDB1 ${GDB1} -r GDB2 ${GDB2}
done
# Loop for 10 and 11
for i in 10 11; do
echo "Creating ${i}"
export resultdir="GDB${i}"
if [ ! -d ./${resultdir} ]; then
echo "${resultdir} does not exist!"
mkdir ${resultdir}
fi
export GDB1="${i}"
export GDB2="${i}"
${CONDA_ENVS_PATH}/${CONDA_ENV}/bin/papermill project.ipynb ${resultdir}/project.ipynb -r GDB1 ${GDB1} -r GDB2 ${GDB2}
done
Note that the flag -r, which asks papermill to read the raw string is passed to papermill instead of -p. This is because ${GDB1} and ${GDB2} should be strings and the parameters flag, -p, will automatically make ${GDB1} and ${GDB2} integer values. To submit this script to the queue type this in command line:
sbatch batch_cpu.sh
It should also be noted that if you would like to use a conda environment you have created in the past, remove all the lines that download and create a new conda environment. I also use ${CONDA_ENVS_PATH}/${CONDA_ENV}/bin/papermill to make sure the right version of papermill is called. In this example I also use mamba, which is a rapid application development platform (RAD) that can be utilized instead of conda for installing python packages.
Python Script f-string alternative
Another option, that I have used in the past, is creating a python script where the variable you would like to change is passed to a string containing your variable as an f-string. This is fine as long as the input/code you are writing iteratively is not too long. If your code is hundreds of lines this is too cumbersome and I would recommend the method above.
This script can submit numerous jobs using a python script that iteratively generates python and SLURM scripts:
for i in dict.keys():
with open(f'{i}/PROJECTNAME.inp','w') as g:
if len(dict[i])!=0:
inp=f"""
File contents that require f-string
{i}
"""
inp2="""
file contents
"""
g.write(inp+inp2)
with open(f'{i}/script.sh','w') as k:
slurm=f"""#!/bin/bash
# Batch script to run Jupyter Notebooks on a CPU node.
#SBATCH --account=ACCOUNTNAME # The project account to be charged
#SBATCH --job-name=python_script
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=4 # cpus per node
#SBATCH --partition=CONDONAME # If not specified then default is "campus"
#SBATCH --qos=QUALITYOFSERVICENAME
#SBATCH --time=0-90:00:00 # Wall time (days-hh:mm:ss)
#SBATCH --error=job.e%J # The file where run time errors will be dumped
#SBATCH --output=job.o%J # The file where the output of the terminal will be dumped
# DEFINE NAME OF THE PROJECT
export Project=PROJECTNAME
# THE COMMAND
executable $Project.inp > $Project.out
"""
k.write(slurm)
os.system(f'cd {i} && sbatch script.sh && cd ../')
References:
- Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physico-chemical properties, compound classes and drug discovery. Fink, T.; Reymond, J.-L. J. Chem. Inf. Model. 2007, 47, 342-353.
- Virtual Exploration of the Small Molecule Chemical Universe below 160 Daltons. Fink, T.; Bruggesser, H.; Reymond, J.-L. Angew. Chem. Int. Ed. 2005, 44, 1504-1508.
- Download the GDB-11 database: https://gdb.unibe.ch/downloads/
- Papermill documentation: https://papermill.readthedocs.io/en/latest/