Running a job
Executing a particle reconstruction model with DRGN-AI
Once the experiment directory your_workdir
and the corresponding configuration file your_workdir/configs.yaml
have been created, we can run the experiment using drgnai train your_workdir
. This will create a subfolder your_workdir/out
that will contain the output of the experiment.
drgnai train
also runs a series of analyses after training is completed using the final reconstruction training epoch. To use an earlier training epoch we can use drgnai analyze your_workdir --epoch 15
, or avoid having train
do analyses and direct the analyses ourselves instead:
drgnai train your_workdir --no-analysis
drgnai analyze your_workdir --epoch 20
Training a reconstruction neural network is usually computationally intensive; we thus recommend using a high-performance compute cluster to run cryoDRGN-AI experiments. For example, a submission script to a cluster using the Slurm job scheduler would look like:
#!/bin/bash
#SBATCH --partition=cryoem
#SBATCH --job-name=drgnai
#SBATCH -t 3:00:00
#SBATCH --gres="gpu:a100:1"
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
cd /scratch_dir/my_name
drgnai setup your_workdir --particles /data_dir/empiar_benchmark/particles.128.mrcs \
--ctf /data_dir/empiar_benchmark/ctf.pkl
drgnai train your_workdir
And submitted using:
(drgnai-env) $ sbatch -t 8:00:00 -p cryoem -J drgnai_test -o drgnai_test.out drgnai_slurm.sh
Note that here we have used the default configuration parameters for how training and analysis will be carried out while manually specifying an input dataset.
Using multiple GPUs
The default behavior of cryoDRGN-AI is to use a single GPU, even if many GPUs are available on the same node. This can be changed using the --multigpu
option to drgnai train
, or setting multigpu: True
in configs.yaml
.
Inverting datasets
Some datasets, such as 50S (EMPIAR-10076) do not need to be inverted from light-on-dark images to dark-on-light images before training due to differing conventions used in upstream processing. This can be done by adding the inverse_data: False
setting to configs.yaml
.
Reusing an output directory
The same output folder can be used again many times for multiple train
runs. If a non-empty out/
subfolder already exists in your output folder when you run train
, it will rename the existing experiment output using an automatically-generated label beginning with old_out_
which can then be renamed as necessary.
Restarting finished experiments
Experiments that have finished running can be run for further epochs using the --load
flag, which tells train
to look for the last finished epoch in the given output directory and start from there. This uses the already saved configuration parameters, meaning no further arguments must be specified:
drgnai train your_workdir/ --load
You can also specify a particular epoch to load by passing its saved checkpoint as the load
parameter in configs.yaml
:
load: /full-path-to-your-work-dir/out/weights.95.pkl
Monitoring running experiments
A running log of the training step is saved at out/training.log
both during and after the training step.
The training step can be monitored while it is running using Tensorboard, which is installed as part of cryoDRGN-AI, by following these steps:
Run the command
tensorboard --logdir your_workdir/out --port 6565 --bind_all
remotely, whereout-dir
is the experiment output directory and6565
is an arbitrary port number.Run the command
ssh -NfL 6565:<server-name>:6565 <user-name>@<server-address>
locally, using the same port number above, and replacing the server info with your own.Navigate to
localhost:6565
in your local browser to access the Tensorboard interface.
For example, in the following case, <server-name>
would be della-gpu
once you have run this command remotely:
tensorboard --logdir out-dir/out --port 6565 --bind_all
TensorFlow installation not found - running with reduced feature set.
TensorBoard 2.16.2 at http://della-gpu:6565/ (Press CTRL+C to quit)
👉 Using a terminal multiplexer like tmux will make your life easier!
👉 You can monitor your experiments (and see their job IDs) with watch -n 3 squeue -u YOUR_USERNAME
Examining experiment outputs
Once the analysis done by train
or analyze
is finished, the outputs can be accessed in out_dir/out/analysis_100
where 100
corresponds to the epoch (0-indexed) used in the analysis step.
Last updated