Run a case

Tested Clusters

SmartFlow has been tested and verified on various HPC clusters in Europe and China. The following table summarizes the hardware environment and job scheduling systems used during deployment.

HPC Clusters Tested with SmartFlow

Country

Cluster

Partition

Scheduler

Network Interface

Italy

CINECA

Booster

Slurm

ib0

China

BSCC (Beijing Super Cloud Center)

N32EA14P

Slurm

ib0

China

BSCC (Beijing Super Cloud Center)

BSCC-A

Slurm

ib0

Running on a standalone machine

To run a case on a standalone machine, we can use:

python /your/path/SmartFlow/src/smartflow/main.py

It should be noted that the main.py file is located in the /your/path/SmartFlow/src/smartflow/main.py .

You may encounter an error message in your current running folder ../../SmartFlow/examples/train_retau_05200/err file. Please check the error message and fix it with your own settings. Perhaps, you may see some errors about the wandb as we import wandb library. If you see wandb error, please create or login wandb with your own account and add API_key as follows:

import wandb
wandb.login(key="your_api_key")

Running with SLURM on a CPU cluster

To run a case on a CPU cluster, we can use a SLURM script such as:

#!/bin/bash
#SBATCH --time=48:00:00
#SBATCH --nodes=2
#SBATCH --job-name=smartflow
#SBATCH --account=user_account
#SBATCH --qos=qos_name
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err

python main.py

The script is submitted with the following command:

sbatch slurm.sh

For this setting, we allocate 2 nodes with 32 tasks per node and 1 CPU per task (for a total of 64 tasks). The job is submitted to the qos_name queue under the user_account account. The job is expected to run for 48 hours. The output and error logs are saved in the slurm-%j.out and slurm-%j.err files, respectively, where %j represents the job ID.

Running with SLURM on a GPU-accelerated cluster

To run a case on a GPU-accelerated cluster, we can use a SLURM script such as:

#!/bin/bash
#SBATCH --time=48:00:00
#SBATCH --nodes=2
#SBATCH --job-name=smartflow
#SBATCH --account=user_account
#SBATCH --qos=qos_name
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:4
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err

python main.py

For this setting, we allocate 2 nodes with 32 tasks per node and 8 CPUs per task, along with 4 GPUs per node for acceleration. The job is submitted to the qos_name queue under the user_account account. The job is expected to run for 48 hours. The output and error logs are saved in the slurm-%j.out and slurm-%j.err files, respectively, where %j represents the job ID.