CryoDRGN EMPIAR-10076 Tutorial
step-by-step tutorial on reconstruction with the assembling ribosome dataset using cryoDRGN
This walkthrough of cryoDRGN analysis of the assembling ribosome dataset (EMPIAR-10076) from Figure 5 of Zhong et al. covers:
preprocessing of inputs,
initial cryoDRGN training and analysis,
removing junk particles,
high-resolution cryoDRGN training and analysis,
extracting particles for traditional refinement, and
generation of trajectories.

For an abbreviated overview of the steps for running cryoDRGN, see the README on the github page: https://github.com/zhonge/cryodrgn
All pre-processed inputs that are generated in this tutorial (except particle images) are also located here: https://github.com/zhonge/cryodrgn_empiar
The final trained networks and density maps are deposited here: https://zenodo.org/record/4355284.
For compute requirements and installation instructions, see cryoDRGN installation with anaconda.
1) Obtain the dataset
Download EMPIAR-10076 (~51GB) from https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10076/
You can download it directly from the browser, or from the command line using Aspera Connect (More info here)
$ ascp -QT -l 200M -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh emp_ext3@hx-fasp-1.ebi.ac.uk:/10076 .
This dataset contains 131,899 extracted particles with box size of 320 and pixel size of 1.31 A/pix.

The particle images are in the file
L17Combine_weight_local.mrc. There is a typo in the file extension, so rename the.mrcfile with the.mrcsextension.$ mv L17Combine_weight_local.mrc L17Combine_weight_local.mrcs
The CTF parameters for each particle are in the metadata file,
Parameters.star. We will extract the CTF parameters from this file later in Step 3.$ head Parameters.star -n 20
Poses (i.e. particle alignments) are not present in the deposited data, so we will next run a consensus reconstruction.
2) Consensus reconstruction (optional)
Perform a C1 homogeneous refinement in your favorite cryo-EM software package. We will be using the poses from this "consensus reconstruction".
To skip this step, download the poses from https://github.com/zhonge/cryodrgn_empiar/blob/main/empiar10076/inputs/cryosparc_P4_J33_004_particles.cs
A detailed walkthrough of this process is out of the scope of this guide, but we will briefly summarize the steps we used in cryoSPARC v2.4.
In cryoSPARC, 1) import the particles, 2) run an ab initio reconstruction job, then 3) run a homogeneous refinement job, all with default parameters.
For step (1), note that the data-sign of EMPIAR-10076 is flipped from most cryo-EM datasets (i.e. dark-on-light particles), so you will need to toggle the data sign when importing the particles.
Final result of refinement:




3) Preprocess inputs
In this step, we will first extract poses (Step 3.1), then extract CTF parameters (Step 3.2), then downsample our images (Step 3.3).
Pose and CTF parameters are stored in various formats depending on the upstream processing software. CryoDRGN contains scripts to convert this information to a .pkl file format.
3.1) Convert poses to cryoDRGN format
CryoDRGN has two executables for parsing pose information (i.e. particle alignments) from either a cryoSPARC
.csfile or a RELION.starfile.cryodrgn parse_pose_star -h-Dshould be set to the box size of the refinement (and not the downsampled image size)Note: the argument
--Apix X.XXis required if the pixel size is not found in the .star file. The pixel size should be set to the pixel size of the refinement (corresponding to-D).
cryodrgn parse_pose_csparc -h
For this tutorial, we will use
cryodrgn parse_pose_csparcto extract poses from the cryoSPARC refinement outputcryosparc_P4_J33_004_particles.csfrom Step 2.Example command and output:
$ cryodrgn parse_pose_csparc cryosparc_P4_J33_004_particles.cs -D 320 -o poses.pkl
Note:
-Dshould be set to the box size of the refinement, which was 320 in this case
3.2) Convert CTF parameters to cryoDRGN format
CryoDRGN has two executables for parsing CTF information from either a cryoSPARC
.csfile or a RELION.starfile.cryodrgn parse_ctf_star -hcryodrgn parse_ctf_csparc -h
We can extract the CTF parameters from the deposited
Parameters.starfile:Example command:
$ cryodrgn parse_ctf_star Parameters.star --Apix 1.31 -D 320 -o ctf.pkl --ps 0Example output
The pixel size (
--Apix 1.31), image size (-D 320), and phase shift (--ps 0) are provided separately since they are not present in the .star file.
This metadata is also present in the cryoSPARC refinement
.csfile. Below is an example usage ofcryodrgn parse_ctf_csparcto extract the CTF information from thecryosparc_P4_J33_004_particles.csfile:Example command:
$ cryodrgn parse_ctf_csparc cryosparc_P4_J33_004_particles.cs -o ctf.pklExample output
3.3) Downsample images
CryoDRGN training time is highly dependent on the image size (See Fig. 2E in Zhong et al.). We will downsample images to an image size of D=128 (where D is the image dimension in pixels, i.e. a 128x128 image) using cryodrgn downsample. Later on, we will train a higher-resolution model using larger images (D=256).
Because of the tradeoffs between training time and representation capacity of the neural networks, we recommend training on images with a maximum boxsize of D=256. In general, to optimize training time, we encourage using the smallest possible image size for the desired/expected resolution of the final density maps and their motions. Here, using an image size of (256x256) corresponds to a pixel size of 1.31*320/256 = 1.6375 A/pix and a Nyquist resolution of 3.275 A, which is still below the resolutions of the published density maps (~4-5 A).
cryodrgn downsample -h
Downsample to D=256
First, downsample our input file
L17Combine_weight_local.mrcsto an image size of D=256, saved as the output fileparticles.256.mrcs.Example command:
Use the
--chunk 50000flag to chunk the output into separate .mrcs containing 50k images each to avoid out-of-memory errors when saving out a large particle stack. Now, instead of a single output file, downsampled images will be stored in three separate .mrcs files (particles.256.0.mrcs,particles.256.1.mrcs, andparticles.256.2.mrcs), the first two containing 50k images, and the third with the remaining 31,899 images, and a text file,particles.256.txtlisting the three .mrcs files.Note: Chunked .mrcs files that are listed in a .txt file can contain either absolute paths or relative paths. By default the .txt file will only contain the file names (therefore it must remain in the same directory as the .mrcs).
Downsample to D=128
Next downsample our D=256 particles to D=128. Example command:
The downsampled image stack will be 8.1GB
Note: You can also downsample from the original particles; it is faster (and equivalent) to do so from the smaller D=256 images.
Additional usage notes
cryodrgn downsamplecan also be used to downsample volumes using the--is-volargument.The input format to specify the particle stack may also be a
.starfile or a.csfile.If the paths to the .mrcs particles given by the .star/.cs file are broken, you can overwrite them using the argument
--datadir [PATH TO DIRECTORY WITH .MRCS]. In some cases, the--datadirpath should point to the project directory in order to complete relative file paths given in the .star or .cs file.
4) CryoDRGN training
When the input image stack (.mrcs), image poses (.pkl), and CTF parameters (.pkl) have been prepared, a cryoDRGN model can be trained with cryodrgn train_vae:
cryodrgn train_vae -h
Many of the parameters of this script have sensible defaults. The required arguments are:
an input image stack (
.mrcsor other listed file types)--poses, image poses (.pkl) that correspond to the input images (see Step 3.1)--ctf, ctf parameters (.pkl), unless phase-flipped images are used (see Step 3.2)--zdim, the dimension of the latent variable-o, a clean output directory for storing results
Additional parameters which are typically set include:
-n, Number of epochs to trainNeural network architecture settings with
--enc-layers,--enc-dim,--dec-layers,--dec-dim--multigputo enable parallelized training across multiple GPUs (fast!)
General recommended workflow
First, train on lower resolution images (e.g. D=128) using the default architecture (fast) as an initial pass to sanity check results and remove junk particles. (D=128, smaller 256x3 architecture)
After any particle filtering, then train a larger model with the
--enc-dim 1024and--dec-dim 1024arguments, which will have more parameters and can potentially learn more heterogeneity. (D=128, larger 1024x3 architecture)Finally, after validation, pose optimization, and any necessary particle filtering, then train on the full resolution image stack (up to D=256) with a large architecture. (D=256, larger 1024x3 architecture)
In this tutorial, we will walk through the commands and analysis for Step 1 and Step 3 in the above workflow. You can run step 2 on your own to see how the results compare.
CryoDRGN initial training
Run
cryodrgn train_vaeon the full particle stack of D=128 particles (1 GPU; 64GB memory requirement)Example command:
Inputs:
particles (.mrcs, .star, .txt, or .cs) format, here
data/128/particles.128.mrcs--ctfctf parameters in a cryodrgn .pkl file, heredata/ctf.pkl--posesposes in a cryodrgn .pkl file, heredata/poses.pkl
Arguments:
--zdim 8to specify the dimension of the latent variable (i.e. each particle will get assigned an 8-dimensional vector as its latent embedding)-n 50to specify 50 epochs of training (i.e. 50 iterations through the dataset)-o tutorial/00, a clean output directory (will get created if it does not already exist)Use the
--uninvert-dataflag to flip the data sign of the particles dataset (This flag is not needed for most cryo-EM datasets)
Outputs:
Log
The training will take ~7.5 min/epoch on a V100 GPU, requiring ~7 hours total. The output directory will contain the following files:
config.yamla configuration file containing all inputs and settings for the jobweights.pklthe final neural network weights (and intermediate checkpointsweights.n.pkl)z.pklthe final per-particle latent embeddings (and embeddings from intermediate epochsz.n.pkl)
Note you can use
cryodrgn analyzeto analyze intermediate epochs during training.
Extending or restarting from a checkpoint
If the run was interrupted before it reached the desired number of epochs or you would like to train longer, a training job can be extended with the --load argument. For example, to resume training that was interrupted at epoch 24:
You can also use --load latest, and cryodrgn will try to detect the latest set of model weights in the specified output directory.
5) cryoDRGN analysis
Once the model has finished training, the cryodrgn analyze command is automatically run in order using the final training epoch to visualize the latent space, generate density maps, and generate template Jupyter notebooks for further interactive filtering, visualization, and analysis. The analyze command can subsequently be run on any output epoch with saved output, e.g.:
This script runs a series of standard analyses. described below:
PCA and UMAP visualizations of the particle latent embeddings
Generation of volumes from the latent embeddings at kmeans cluster centers
Generation of trajectories along the first and second principal components of the latent embeddings
Generation of a template jupyter notebook that may be used for further interactive analyses, visualization, and volume generation
[New in version 0.3.2] Generation of a template jupyter notebook that may be used for particle filtering
[New in version 2.2.0] Generation of a template jupyter notebook for adjusting any of the output figures
Example output log
What's in the analysis directory?
Once cryodrgn analyze finishes, there will be a new subdirectory, analyze.50/, within the job directory:
Visualization of the latent space
Because we trained an 8-dimensional latent variable model (--zdim 8), each particle image is associated with an 8-D vector embedding (saved in z.50.pkl ). cryodrgn analyze runs PCA (a linear dimensionality reduction technique) and UMAP (a nonlinear dimensionality technique) to visualize these latent embeddings. These images will be found in the analyze.50 directory:




We can see there are 5 major clusters in the latent space similar to Figure 5B in Zhong et al. We will show how these align with the 4 major classes of the LSU ribosome + a fifth class of junk particles, and how to remove the fifth junk class from the particle stack.
Sampled density maps
The analyze.50 directory will also contain a subdirectory kmeans20/ with representative reconstructed density maps:
cryodrgn analyze uses the k-means clustering algorithm to partition the latent space into k regions (by default k=20). A density map is generated from the center of each of these clusters. The kmeans20 volumes provide an initial set of representative density maps to visually inspect (and not necessarily to assign classes).
The directory will also contain PCA and UMAP visualizations with annotations of where the 20 density maps were generated.




PC trajectories
There are two subdirectories in the analysis directory that will contain volumes generated along the first and second principal components (PC) of the latent space. By default, the volumes are generated at equally spaced points between the first and 99th percentile of the data distribution projected onto each principal component (e.g. Fig 6c of Zhong et al).
Note 1: The PC trajectories can give a sense of the variability in the structure, however, the principal components of cryoDRGN's latent space are not principal components of the volumes due to the nonlinear nature of the decoder.
Note 2: Principal component trajectories can also produce nonphysical "motions", e.g. if there is discrete variability in the structure.
Additional usage notes:
Use the flag
--pc Nto change the number of PCs that are generated bycryodrgn analyzeFor more flexibility in the exact start and end points and number of volumes in the PC trajectories, see
cryodrgn pc_traversal -h


The umap.png plot within the pcX subdirectories shows the UMAP embedding of each particle, colored by its PCX value. This helps give a sense of the layout of the latent space and how the UMAP embedding is related to the PCA projection.
6) Particle filtering with the cryoDRGN Jupyter notebook
This section will walk through how to use the cryoDRGN Jupyter notebook, cryoDRGN_filtering.ipynb, to filter junk particles out of the dataset. The Jupyter notebook provides an interactive environment for running Python code blocks to analyze the cryoDRGN results.
6.1) Accessing the jupyter notebook
Jupyter is a web application that requires an internet browser to use. If you have browser access on the machine where you ran the cryodrgn experiment, you can start the jupyter notebook from the command line with the command jupyter notebook:
If the cryodrgn training was performed on a remote machine (e.g. on a compute cluster), you can still view the notebook on your local machine (e.g. laptop) by setting up SSH port forwarding. For more information, see: https://docs.anaconda.com/anaconda/user-guide/tasks/remote-jupyter-notebook/.
The port number can be set to an arbitrary number between 1024 through 49151 (as long as the port is unused). Jupyter should detect if a port is already in use and will automatically use the next (N+1) port number.
Here is a screenshot of the page that will show up in your browser. It should list all the files in that directory:

Then click cryoDRGN_filtering.ipynb to start the jupyter notebook:

6.2) Run the jupyter-notebook for particle filtering
The jupyter notebook has template code for performing all the analysis and plotting and should require minimal intervention. For a general overview of running jupyter notebooks, see their documentation here: https://jupyter.readthedocs.io/en/latest/running.html.
The cryoDRGN_filtering.ipynb jupyter notebook has three methods for filtering particles:
clustering of the latent embeddings (k-means or Gaussian mixture model)
outlier detection (Z-score)
interactive selection with a lasso tool
For each method, the selected particles are tracked in the variable, ind_selected.
Once the selection has been finalized, the selected particles are saved into the cryoDRGN workdir as a index.pkl file at the end of the notebook. The selected indices can then be provided to cryoDRGN with the --ind argument to train a new model on a subset of the images or converted to .star file format.
For this tutorial, we will use the first method (GMM clustering) to follow the methodology used in Zhong et al. (See Extended Data Fig. 5). However, we will also demo the other methods, which you are welcome to explore.
Then run the cells of the notebook to load the relevant results until the first "Filtering by Cluster" section.
Published filtering results
This section shows the tutorial results colored by the filtered particles in the original publication of the dataset (Davis et al) and in the cryoDRGN paper (Zhong et al). For prospective analysis, the "junk" is usually identified by visualizing the reconstructed volumes and/or the particles images, and especially straightforward for edge/ice artifacts. The filtering may be validated by 2D classification or other traditional tools. In this dataset, the junk particles also have a very different pose distribution (not shown, but you can view this in the interactive visualization section of the jupyter notebook).
3D classification labels
(Optional) Add these code blocks in the jupyter notebook if you wish to generate these visualizations.




In the next section, we will show the steps to remove the center cluster in the UMAP corresponding to the junk particles (the brown cluster in the 3D classification labels).
6.3) Filtering by GMM cluster label
In this section, we demo the steps and outputs for filtering out the junk cluster using a Gaussian mixture model (GMM) clustering algorithm.
The "Filtering by cluster" section of the Jupyter notebook has two subsections — filtering by kmeans clustering and by GMM clustering. You can try both options and play around with the cluster # and the random seed.
In the GMM-clustering section, change the cluster number (G=6 here) and rerun as necessary to get the desired clustering. Because GMM is a non-deterministic algorithm, the results are sensitive to the initial random seed. You can rerun the cells to try a different random seed.
Note that the clustering is performed on the latent embedding (z) and not on the umap embeddings (umap). This can be experimented with as well.


The resulting 6 clusters:



Set clusters 2 and 5 (green and brown) as the bad/junk clusters. The code block will then identify the particles corresponding to those clusters.

Select particles from clusters 2 and 5.


Alternative method: Filtering by z-score
In practice, we often find that junk particles are located as outliers in the distribution of latent embeddings. This section selects outlier particles by the magnitude of the embedding.

These plots will be automatically generated that help visualize which particles are selected



Alternative method: Filtering with an interactive lasso tool
Sometimes, the best clustering algorithm is the human eye (e.g. kmeans/GMM clustering algorithms can fail to detect small/oddly-shaped clusters). The cryoDRGN_filtering.ipynb notebook also provides an interactive widget that lets you interactively select regions of the latent space using a lasso tool.
When the widget first shows up, you will need to change the x-axis and y-axis labels to plot the particles by their UMAP embeddings. You can also change the plotting colors.
(Note: Sometimes the tool struggles when there are >500k particles. This is a known issue.)
(Note: There have been installation issues with jupyter widgets on some linux systems. See [link to github].)



View the raw particles
The second to last section of the jupyter notebook contains code blocks for viewing the raw particle images. This section will visualize 9 images at random from the selected particles. This is sometimes useful for verifying that the selected particles are problematic in obvious cases of junk particles.



Saving the selection
The last section of the notebook has cells that will help save the selected indices to use in downstream processing.

In the filtering sections, the selected particles are tracked in the variable ind_selected and ind_selected_not. Because we selected the bad particles, we will now switch what the ind_keep and ind_bad variables are set to in the first cell. This is purely for organizational/file naming purposes.

Visualization of our kept particles, ind_keep:


The last three cells will print out a summary of the selection and where the indices were saved:

After running these cells, there should be two files saved in the output directory of the cryodrgn job (the number of particles should be similar but most likely not identical):
ind_keep.96509_particles.pklind_bad.35390_particles.pkl
These .pkl files contain a numpy array of indices into the particle stack, which can be provided to cryodrgn train_vae with the --ind flag to restrict training on the subset of the desired particle. This avoids having to extract a new particle stack.
(Additional Functionality) Writing a new .star file
The selected particles can be converted to a .star file using
cryodrgn_utils write_star. See Section 8.3 for more details on this script.
(Additional Functionality) Extracting a new particle stack
A new particle stack can be extracted using the script
cryodrgn_utils filter_mrcs.
7) CryoDRGN high-resolution training
Now that we have identified the junk particles, we will rerun cryodrgn train_vae on larger, higher-resolution images (D=256) using a larger neural network model (1024 dim x 3 layer architecture).
Example command to run on 1 GPU
By providing the --ind argument, cryoDRGN will filter the list of particles, ctf parameters, and poses by the provided index array.
This is the most compute-intensive step of the tutorial. The training will take ~25 min/epoch on a single V100 GPU, requiring ~21 hours total for 50 epochs.
Accelerated training with GPU parallelization
Use cryoDRGN's --multigpu flag to enable parallelized training across all detected GPUs on the machine. To select specific GPUs for cryoDRGN to run on, use the environmental variable CUDA_VISIBLE_DEVICES, e.g.:
When training is parallelized across multiple GPUs, the batch size (number of images trained in each mini-batch of SGD; default -b 8) will be automatically scaled by the number of available GPUs to better take advantage of parallelization. Depending on your compute resources, GPU utilization may be improved with -b 16 (i.e. to achieve linear scaling of runtime with # GPUs). However, note that GPU parallelization, while leading to a faster wall-clock time per epoch, may require increasing the total number of epochs, since the training dynamics are affected (fewer model updates per epoch with larger -b).
Accelerated training with mixed-precision mode
Mixed precision training is available for Nvidia GPUs with tensor core architectures and can lead to order of magnitude speedups in training. As of cryoDRGN version 1.1, mixed precision training is now automatically enabled (and can be turned off with the --no-amp flag).
Note: We recommend using --multigpu for larger architectures (1024x3) or larger images (D=256). For smaller models/images, GPU computation may not be the training bottleneck. In this case, GPU parallelization and mixed-precision training may have a limited effect on the wall clock training time, while taking up additional compute resources.
8) Analysis
We will first walk through the default outputs from cryodrgn analyze (Section 8.1), then we will use the cryoDRGN jupyter notebook to visualize and generate assembly state density maps (Section 8.2), extract the particles for newly identified state C4 (Section 8.3), and use cryoDRGN's graph traversal algorithm to generate trajectories of a ribosome assembly pathways (Section 8.4).
8.1) cryodrgn analyze
Similar to Step 5 above, we first run the default analysis pipeline with cryodrgn analyze :
$ cryodrgn analyze -h
Example command: (1 GPU, ~10 min)
We flip the handedness of the output volumes with the flag --flip and set the pixel size in the .mrc header with --Apix:
Output log
As before (Section 5), this will create a new directory analyze.49 in the workdir of the cryodrgn job containing UMAP/PCA visualizations of the latent space, 20 sampled density maps at kmeans cluster centers, 10 density maps along PC1 and PC2, and jupyter notebooks for interactive analysis.
Additional usage notes:
The output volumes will be 256^3. To generate downsampled volumes (e.g. to take up less disk space), use the argument
-d 128.More volumes can be generated with
--ksample NMore PCs (up to the dimension of the latent variable) can be generated with
--pc NThe pixel size and handedness of a
.mrcfile can also be modified later with the scriptscryodrgn_utils add_psizeandcryodrgn_utils flip_handin theutilssubdirectory of the cryodrgn repository.
Visualization of the latent space
Because we trained an 8-dimensional latent variable model (--zdim 8), PCA (a linear dimensionality reduction technique) and UMAP (a nonlinear dimensionality technique) are used to visualize the distribution of particle latent embeddings.
PCA scatter plot (left) and density plot (right)


UMAP scatter plot (left) and density plot (right)


The layout of the latent space reproduces the original results shown in Fig 5F of Zhong et al up to the trivial inversion of UMAP2 axis.
Default sampled density maps
cryodrgn analyze uses the k-means clustering algorithm to partition the latent space into regions (by default k=20 regions), and generate a density map from the center of each of these regions. The goal is to provide a tractable number of representative density maps to visually inspect.
Additional usage notes:
The number of partitions can be updated with the argument --ksample 50 to change the number of density maps to e.g. 50.

PC trajectories
There are two subdirectories in the analysis directory that will contain volumes generated along the first and second principal components (PC) of the latent space. By default, the volumes are generated at equally spaced points between the first and 99th percentile of the distribution projected onto each principal component (e.g. Fig 6c of Zhong et al).
Note 1: The PC trajectories can give a sense of the variability in the structure, however, the principal components of cryoDRGN's latent space are not principal components of the volumes due to the nonlinear nature of the decoder.
Note 2: Principal component trajectories can also produce nonphysical "motions", e.g. if there is discrete variability in the structure.
Additional usage notes:
Use
--pcto change the number of PCs that are generated bycryodrgn analyzeFor more flexibility in usage, see
cryodrgn pc_traversal
PC1 volumes

PC2 volumes

The umap.png plots within the pcX subdirectories show the UMAP embedding colored by each particle's projected value along PCX. This helps give a sense of the layout of the latent space and how the UMAP embedding (a nonlinear 8D → 2D embedding) is related to the PCA projection (a linear projection from 8D → 2D).
8.2) The cryoDRGN jupyter notebook
CryoDRGN provides a jupyter notebook, cryoDRGN_viz.ipynb, which contains more visualization tools to facilitate interactive exploration of the trained model.
Starting the jupyter notebook
See Section 6.1.
Running the jupyter notebook
Make sure the epoch number specified in the third cell is set to the correct epoch that we are analyzing. Since we have trained for 50 epochs, it should be 49 here (0-based indexing).
Then, you can run each cell one by one to see the outputs. Alternatively, you can run the entire notebook in one go with Cell → Run All.
Additional plots in the Jupyter notebook include:
Plot of the learning curve (i.e. average loss per epoch)

Plot of the pose distribution



In the "Interactive visualization" section, any per-particle cryoDRGN data are loaded into an interactive widget for visualization.
Advanced usage: Additional data series may be added to the pandas dataframe and will automatically show up in the widget.

Selected the desired x and y axis data series.

As a debugging tool, it is sometimes useful to visualize the latent space colored by pose or CTF parameters. Regions of the latent space are enriched in particular viewing directions or defocus values may be a sign that the latent variable is modeling these imaging variations instead of structural heterogeneity.


Comparing to 3D classification labels
(Optional) Add these code blocks in the jupyter notebook after the data loading sectio to plot by 3D classification labels for major states and minor assembly states.
Plot major classes
Plot minor classes


Generating additional volumes
To generate the volumes associated with each of these assembly states, we can use the last section of the jupyter notebook:
For the variable
vol_ind, replace the empty list with the indices corresponding to the assembly state cluster centers:

The next cell automatically creates a clean output directory, but you can also define your own, e.g.
outdir = "major_states".Then run the final cell to generate volumes at the values of
zat the indices defined withvol_ind. The volumes will be saved in the specified directory. The cell will also return links to directly download the file into your Downloads folder, which is helpful if accessing the jupyter notebook via ssh.

8.3) Extracting particles for traditional refinement
In Zhong et al, we identified a new assembly state, C4, which was validated with traditional refinement (see Figure 5F). This section shows how to use the interactive lasso tool to select particles from a region of the latent space and save the particles in .star file format.
Interactive selection
Use the interactive lasso tool to select the small cluster of particles near the class C megacluster. The table below the plot will update with the indices of the selected datapoints.


Then, evaluate the next cell to save the selected particles into the variable ind_selected. In this example, 1146 particles were selected.
Note: A representative C4 structure from cryoDRGN can be generated with the Volume Generation section of the notebook. Give it a try!
Evaluate the next few cells to view the selection in PCA and UMAP space:

In the next section of the jupyter notebook, there is template code for saving the index selection. Rename the path as ind_selected_classC4.pkl and evaluate the cells to save the selection as an index .pkl file. We will convert it to a .star file in the next step.

Note: An index selection saved as a .pkl file can also be generated from clustering tools in the cryoDRGN_filtering.ipnyb jupyter notebook. See Section 6.3.
Converting to a .star file
The cryodrgn_utils write_star tool can be used to convert the index .pkl of selected particles (and the input particles .mrcs, and optionally the ctf and/or pose .pkl) to .star file format:
$ cryodrgn_utils write_star -h
Note: This command will add ctf/pose information from the corresponding .pkl files, if provided. Any additional fields in the input .star file will be preserved as-is.
Example command:
Filtering an existing .star file
Alternatively to filter a .star file with all the original metadata, use cryodrgn filter_star:
8.4) Generating trajectories
cryodrgn eval_vol can be used to generate additional density maps given a list of z values. (See the Github README for more complete documentation).
In this section, we will use cryoDRGN's graph traversal algorithm to find paths through latent space (cryodrgn graph_traversal), and then generate volumes along this path (cryodrgn eval_vol).
cryodrgn graph_traversal
$ cryodrgn graph_traversal -h
CryoDRGN's graph traversal algorithm builds a nearest neighbor graph between all the latent embeddings, and then performs Dijkstra's algorithm to find the shortest path on the graph between the anchors nodes. The idea is to find a path between the anchors points in latent space while remaining on the data manifold since we don't want to generate structures from empty regions of the latent space.
Example usage, where the kmeans cluster centers are used as the anchor points:
Note that you could specify arbitrary data points as the anchor points, e.g. you could manually find the indices of desired data points from the jupyter notebook (in the interactive visualization sections) that you want as the endpoints of the trajectory.
cryodrgn graph_traversal will produce a text file containing a list of z values (z.path.txt). This may be converted to .mrc files with cryodrgn eval_vol, e.g.
Generating assembly trajectories
In section 8.2 above, we identified the representative particles corresponding to each minor assembly state:
To generate a trajectory for assembly pathway B→D1→D2→D3→D4→E3→E5 in Figure 7 of Davis et al, use the corresponding assembly state indices as the anchor points in cryodrgn graph_traversal:
This will produce a path.txt file and a z.path.txt file showing the indices of the path and the values of the latent variable along the path. Then use cryodrgn eval_vol to generate volumes from the latent space graph traversal:
Last updated