> For the complete documentation index, see [llms.txt](https://ez-lab.gitbook.io/cryodrgn-ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ez-lab.gitbook.io/cryodrgn-ai/running-a-job.md).

# Running a job

Once the experiment directory `your_workdir` and the corresponding configuration file `your_workdir/configs.yaml` have been created, we can run the experiment using `drgnai train your_workdir`. This will create a subfolder `your_workdir/out` that will contain the output of the experiment.

`drgnai train` also runs a series of analyses after training is completed using the final reconstruction training epoch. To use an earlier training epoch we can use `drgnai analyze your_workdir --epoch 15`, or avoid having `train` do analyses and direct the analyses ourselves instead:

```
drgnai train your_workdir --no-analysis
drgnai analyze your_workdir --epoch 20
```

Training a reconstruction neural network is usually computationally intensive; we thus recommend using a high-performance compute cluster to run cryoDRGN-AI experiments. For example, a submission script to a cluster using the Slurm job scheduler would look like:

```
#!/bin/bash
#SBATCH --partition=cryoem
#SBATCH --job-name=drgnai
#SBATCH -t 3:00:00
#SBATCH --gres="gpu:a100:1"
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G

cd /scratch_dir/my_name

drgnai setup your_workdir --particles /data_dir/empiar_benchmark/particles.128.mrcs \
                            --ctf /data_dir/empiar_benchmark/ctf.pkl
drgnai train your_workdir
```

And submitted using:

```
(drgnai-env) $ sbatch -t 8:00:00 -p cryoem -J drgnai_test -o drgnai_test.out drgnai_slurm.sh 
```

Note that here we have used the default configuration parameters for how training and analysis will be carried out while manually specifying an input dataset.

#### Using multiple GPUs

The default behavior of cryoDRGN-AI is to use a single GPU, even if many GPUs are available on the same node. This can be changed using the `--multigpu` option to `drgnai train`, or setting `multigpu: True` in `configs.yaml`.

#### Inverting datasets

Some datasets, such as 50S (EMPIAR-10076) do **not** need to be inverted from light-on-dark images to dark-on-light images before training due to differing conventions used in upstream processing. This can be done by adding the `inverse_data: False` setting to `configs.yaml`.

### Reusing an output directory

The same output folder can be used again many times for multiple `train` runs. If a non-empty `out/` subfolder already exists in your output folder when you run `train`, it will rename the existing experiment output using an automatically-generated label beginning with `old_out_` which can then be renamed as necessary.

### Restarting finished experiments

Experiments that have finished running can be run for further epochs using the `--load` flag, which tells `train` to look for the last finished epoch in the given output directory and start from there. This uses the already saved configuration parameters, meaning no further arguments must be specified:

```
drgnai train your_workdir/ --load
```

You can also specify a particular epoch to load by passing its saved checkpoint as the `load` parameter in `configs.yaml`:

```
load: /full-path-to-your-work-dir/out/weights.95.pkl
```

## Monitoring running experiments

A running log of the training step is saved at `out/training.log` both during and after the training step.

The training step can be **monitored** while it is running using [Tensorboard](https://www.tensorflow.org/tensorboard), which is installed as part of cryoDRGN-AI, by following these steps:

1. Run the command `tensorboard --logdir your_workdir/out --port 6565 --bind_all` remotely, where `out-dir` is the experiment output directory and `6565` is an arbitrary port number.
2. Run the command `ssh -NfL 6565:<server-name>:6565 <user-name>@<server-address>` locally, using the same port number above, and replacing the server info with your own.
3. Navigate to `localhost:6565` in your local browser to access the Tensorboard interface.

For example, in the following case, `<server-name>` would be `della-gpu` once you have run this command remotely:

```
tensorboard --logdir out-dir/out --port 6565 --bind_all
TensorFlow installation not found - running with reduced feature set.
TensorBoard 2.16.2 at http://della-gpu:6565/ (Press CTRL+C to quit)
```

👉 Using a terminal multiplexer like [tmux](https://github.com/tmux/tmux/wiki) will make your life easier!

👉 You can monitor your experiments (and see their job IDs) with `watch -n 3 squeue -u YOUR_USERNAME`

## Examining experiment outputs

Once the analysis done by `train` or `analyze` is finished, the outputs can be accessed in `out_dir/out/analysis_100` where `100` corresponds to the epoch (0-indexed) used in the analysis step.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://ez-lab.gitbook.io/cryodrgn-ai/running-a-job.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
