CryoDRGN-AI ab initio EMPIAR-10076 tutorial

ab initio reconstruction with the 50S dataset using cryoDRGN-AI volume reconstruction

Here we provide a walkthrough of an ab initio analysis of the assembling ribosome dataset (EMPIAR-10076) based on the results presented in Figure 2b of the cryoDRGN-AI manuscriptarrow-up-right. We will use the version of cryoDRGN-AI implemented with the cryodrgn abinit command — the original experiment was done using the DRGN-AI packagearrow-up-right.

Figure 2Barrow-up-right from the Nature Methods (2025) cryoDRGN-AI manuscript

We will follow an abbreviated version of the general recommended workflow for cryoDRGN training:

  1. First, train on lower resolution images (e.g. D=128) using the default architecture (fast) as an initial pass to sanity check results and remove junk particles.

  2. After any particle filtering, then train a larger model or longer model with the --dim 1024 and --epochs-pose-search=5 arguments — these can potentially lead to learning more heterogeneity.

For a full step-by-step tutorial that includes all of the preprocessing steps required to prepare an input dataset for analysis with cryoDRGN, see the original tutorial, which is broadly similar to the material below but done using the cryodrgn train_vae command.

1) Initial CryoDRGN-AI training

We begin by running the default architecture, which is designed to run relatively quickly to provide an initial pass for checking results and filtering particles. We assume here that we have downsampled our particles to D=128 at particles.128.txt :

Key arguments for the abinit command

  • particles.128.mrcs input particles (.mrcs, .star, .txt, or .cs) format

  • --ctf ctf.pkl CTF parameters in a cryodrgn .pkl file

  • --zdim 4 to specify the dimension of the latent variable (i.e. each particle will get assigned an 4-dimensional vector as its latent embedding)

    • use --zdim 0 for homogeneous ab initio reconstruction

  • -o 50S_abinit/001_defaults.128/, a clean output directory (will get created if it does not already exist)

  • --uninvert-data flag to flip the data sign of the particles dataset (this flag is not needed for most cryo-EM datasets)

With one H100 GPU we were able to finish training this model in a total of 2h 36min, with the pose search epochs taking roughly forty minutes each and the SGD epochs taking a little over a minute:

We use ChimeraX to examine the volumes reconstructed at each checkpoint epoch and saved as reconstruct.<epoch>.mrc in our output folder. By default cryoDRGN will create a checkpoint after every pretraining and HPS epoch as well as every 5th SGD epoch:

Analyzing cryoDRGN-AI results

The abinit command — like the other reconstruction commands included in cryoDRGN — automatically runs cryodrgn analyze on the final output epoch once model training is complete. The output of these analyses can be found in our cryoDRGN experiments folder under analyze.30/:

These outputs are the same as for commands such as cryodrgn train_vae and are fully described in our main tutorialarrow-up-right.

We can verify that our ab-initio model still identifies heterogeneity found when using fixed-pose reconstruction with the visualization kmeans20/umap_hex.png of the z-latent-space. This is generated using the model outputs along with the images found to be centroids using k-mean clustering on the z-latent-space:

We can again use ChimeraX to visualize the corresponding volumes that have been reconstructed by cryoDRGN-AI for these centroid images and are saved in the same place:

Restarting and extending training from a previous checkpoint

As with train_vae, if cryoDRGN-AI training was interrupted before completing or you would like to train a finished model for more epochs you can use the --load argument with a saved checkpoint weights file to resume training. By specifying a new value for --num-epochs/-n, you can also make the model train for more epochs — here we restart at the 25th epoch and train for a total of 40 epochs, thus using ten more epochs than the original training run:

Checkpoints will still be created as before, along with an analysis on the updated final epoch of training:

2) Retraining cryoDRGN-AI with a larger model

Having confirmed that our method can reconstruct the space of 50S volumes using the default model parameters we now move on to trying a larger model to see if we can find more heterogeneity in this dataset or better-reconstructed volumes:

This time training took a total of 13h 10min on a H100 GPU: pose search epochs took four hours and SGD epochs took 2.6 minutes.

We can once again look at the reconstructed volumes at checkpoint epochs, as well as the z-latent-space UMAP embedding produced by our analysis tool with the final training epoch:

Last updated