# Running a job

Once the experiment directory `your_workdir` and the corresponding configuration file `your_workdir/configs.yaml` have been created, we can run the experiment using `drgnai train your_workdir`. This will create a subfolder `your_workdir/out` that will contain the output of the experiment.

`drgnai train` also runs a series of analyses after training is completed using the final reconstruction training epoch. To use an earlier training epoch we can use `drgnai analyze your_workdir --epoch 15`, or avoid having `train` do analyses and direct the analyses ourselves instead:

```
drgnai train your_workdir --no-analysis
drgnai analyze your_workdir --epoch 20
```

Training a reconstruction neural network is usually computationally intensive; we thus recommend using a high-performance compute cluster to run cryoDRGN-AI experiments. For example, a submission script to a cluster using the Slurm job scheduler would look like:

```
#!/bin/bash
#SBATCH --partition=cryoem
#SBATCH --job-name=drgnai
#SBATCH -t 3:00:00
#SBATCH --gres="gpu:a100:1"
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G

cd /scratch_dir/my_name

drgnai setup your_workdir --particles /data_dir/empiar_benchmark/particles.128.mrcs \
                            --ctf /data_dir/empiar_benchmark/ctf.pkl
drgnai train your_workdir
```

And submitted using:

```
(drgnai-env) $ sbatch -t 8:00:00 -p cryoem -J drgnai_test -o drgnai_test.out drgnai_slurm.sh 
```

Note that here we have used the default configuration parameters for how training and analysis will be carried out while manually specifying an input dataset.

#### Using multiple GPUs

The default behavior of cryoDRGN-AI is to use a single GPU, even if many GPUs are available on the same node. This can be changed using the `--multigpu` option to `drgnai train`, or setting `multigpu: True` in `configs.yaml`.

#### Inverting datasets

Some datasets, such as 50S (EMPIAR-10076) do **not** need to be inverted from light-on-dark images to dark-on-light images before training due to differing conventions used in upstream processing. This can be done by adding the `inverse_data: False` setting to `configs.yaml`.

### Reusing an output directory

The same output folder can be used again many times for multiple `train` runs. If a non-empty `out/` subfolder already exists in your output folder when you run `train`, it will rename the existing experiment output using an automatically-generated label beginning with `old_out_` which can then be renamed as necessary.

### Restarting finished experiments

Experiments that have finished running can be run for further epochs using the `--load` flag, which tells `train` to look for the last finished epoch in the given output directory and start from there. This uses the already saved configuration parameters, meaning no further arguments must be specified:

```
drgnai train your_workdir/ --load
```

You can also specify a particular epoch to load by passing its saved checkpoint as the `load` parameter in `configs.yaml`:

```
load: /full-path-to-your-work-dir/out/weights.95.pkl
```

## Monitoring running experiments

A running log of the training step is saved at `out/training.log` both during and after the training step.

The training step can be **monitored** while it is running using [Tensorboard](https://www.tensorflow.org/tensorboard), which is installed as part of cryoDRGN-AI, by following these steps:

1. Run the command `tensorboard --logdir your_workdir/out --port 6565 --bind_all` remotely, where `out-dir` is the experiment output directory and `6565` is an arbitrary port number.
2. Run the command `ssh -NfL 6565:<server-name>:6565 <user-name>@<server-address>` locally, using the same port number above, and replacing the server info with your own.
3. Navigate to `localhost:6565` in your local browser to access the Tensorboard interface.

For example, in the following case, `<server-name>` would be `della-gpu` once you have run this command remotely:

```
tensorboard --logdir out-dir/out --port 6565 --bind_all
TensorFlow installation not found - running with reduced feature set.
TensorBoard 2.16.2 at http://della-gpu:6565/ (Press CTRL+C to quit)
```

👉 Using a terminal multiplexer like [tmux](https://github.com/tmux/tmux/wiki) will make your life easier!

👉 You can monitor your experiments (and see their job IDs) with `watch -n 3 squeue -u YOUR_USERNAME`

## Examining experiment outputs

Once the analysis done by `train` or `analyze` is finished, the outputs can be accessed in `out_dir/out/analysis_100` where `100` corresponds to the epoch (0-indexed) used in the analysis step.
