Running a job
Executing a particle reconstruction model with DRGN-AI
Once the experiment directory your_workdir
and the corresponding configuration file your_workdir/configs.yaml
have been created, we can run the experiment using drgnai train your_workdir
. This will create a subfolder your_workdir/out
that will contain the output of the experiment.
drgnai train
also runs a series of analyses after training is completed using the final reconstruction training epoch. To use an earlier training epoch we can use drgnai analyze your_workdir --epoch 15
, or avoid having train
do analyses and direct the analyses ourselves instead:
Training a reconstruction neural network is usually computationally intensive; we thus recommend using a high-performance compute cluster to run DRGN-AI experiments. For example, a submission script to a cluster using the Slurm job scheduler would look like:
And submitted using:
Note that here we have used the default configuration parameters for how training and analysis will be carried out while manually specifying an input dataset.
Reusing an output directory
The same output folder can be used again many times for multiple train
runs. If a non-empty out/
subfolder already exists in your output folder when you run train
, it will rename the existing experiment output using an automatically-generated label beginning with old_out_
which can then be renamed as necessary.
Restarting finished experiments
Experiments that have finished running can be run for further epochs using the --load
flag, which tells train
to look for the last finished epoch in the given output directory and start from there. This uses the already saved configuration parameters, meaning no further arguments must be specified:
You can also specify a particular epoch to load by passing its saved checkpoint as the load
parameter in configs.yaml
:
Monitoring running experiments
A running log of the training step is saved at out/training.log
both during and after the training step.
The training step can be monitored while it is running using Tensorboard, which is installed as part of DRGN-AI, by following these steps:
Run the command
tensorboard --logdir your_workdir/out --port 6565 --bind_all
remotely, whereout-dir
is the experiment output directory and6565
is an arbitrary port number.Run the command
ssh -NfL 6565:<server-name>:6565 <user-name>@<server-address>
locally, using the same port number above, and replacing the server info with your own.Navigate to
localhost:6565
in your local browser to access the Tensorboard interface.
For example, in the following case, <server-name>
would be della-gpu
once you have run this command remotely:
👉 Using a terminal multiplexer like tmux will make your life easier!
👉 You can monitor your experiments (and see their job IDs) with watch -n 3 squeue -u YOUR_USERNAME
Examining experiment outputs
Once the analysis done by train
or analyze
is finished, the outputs can be accessed in out_dir/out/analysis_100
where 100
corresponds to the epoch (0-indexed) used in the analysis step.
Last updated