Running a job

Executing a particle reconstruction model with DRGN-AI

Once the experiment directory your_workdir and the corresponding configuration file your_workdir/configs.yaml have been created, we can run the experiment using drgnai train your_workdir. This will create a subfolder your_workdir/out that will contain the output of the experiment.

drgnai train also runs a series of analyses after training is completed using the final reconstruction training epoch. To use an earlier training epoch we can use drgnai analyze your_workdir --epoch 15, or avoid having train do analyses and direct the analyses ourselves instead:

drgnai train your_workdir --no-analysis
drgnai analyze your_workdir --epoch 20

Training a reconstruction neural network is usually computationally intensive; we thus recommend using a high-performance compute cluster to run DRGN-AI experiments. For example, a submission script to a cluster using the Slurm job scheduler would look like:

#!/bin/bash
#SBATCH --partition=cryoem
#SBATCH --job-name=drgnai
#SBATCH -t 3:00:00
#SBATCH --gres="gpu:a100:1"
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G

cd /scratch_dir/my_name

drgnai setup your_workdir --particles /data_dir/empiar_benchmark/particles.128.mrcs \
                            --ctf /data_dir/empiar_benchmark/ctf.pkl
drgnai train your_workdir

And submitted using:

(drgnai-env) $ sbatch -t 8:00:00 -p cryoem -J drgnai_test -o drgnai_test.out drgnai_slurm.sh 

Note that here we have used the default configuration parameters for how training and analysis will be carried out while manually specifying an input dataset.

Reusing an output directory

The same output folder can be used again many times for multiple train runs. If a non-empty out/ subfolder already exists in your output folder when you run train, it will rename the existing experiment output using an automatically-generated label beginning with old_out_ which can then be renamed as necessary.

Restarting finished experiments

Experiments that have finished running can be run for further epochs using the --load flag, which tells train to look for the last finished epoch in the given output directory and start from there. This uses the already saved configuration parameters, meaning no further arguments must be specified:

drgnai train your_workdir/ --load

You can also specify a particular epoch to load by passing its saved checkpoint as the load parameter in configs.yaml:

load: /full-path-to-your-work-dir/out/weights.95.pkl

Monitoring running experiments

A running log of the training step is saved at out/training.log both during and after the training step.

The training step can be monitored while it is running using Tensorboard, which is installed as part of DRGN-AI, by following these steps:

  1. Run the command tensorboard --logdir your_workdir/out --port 6565 --bind_all remotely, where out-dir is the experiment output directory and 6565 is an arbitrary port number.

  2. Run the command ssh -NfL 6565:<server-name>:6565 <user-name>@<server-address> locally, using the same port number above, and replacing the server info with your own.

  3. Navigate to localhost:6565 in your local browser to access the Tensorboard interface.

For example, in the following case, <server-name> would be della-gpu once you have run this command remotely:

tensorboard --logdir out-dir/out --port 6565 --bind_all
TensorFlow installation not found - running with reduced feature set.
TensorBoard 2.16.2 at http://della-gpu:6565/ (Press CTRL+C to quit)

👉 Using a terminal multiplexer like tmux will make your life easier!

👉 You can monitor your experiments (and see their job IDs) with watch -n 3 squeue -u YOUR_USERNAME

Examining experiment outputs

Once the analysis done by train or analyze is finished, the outputs can be accessed in out_dir/out/analysis_100 where 100 corresponds to the epoch (0-indexed) used in the analysis step.

Last updated