CryoDRGN Conformational Landscape Analysis
applying the cryodrgn analyze_landscape commands to better understand a model's latent z-space
Last updated
applying the cryodrgn analyze_landscape commands to better understand a model's latent z-space
Last updated
Work in progress. Please file a github issue if there are any bugs or start a discussion on github for any questions or suggestions.
The cryoDRGN framework is based around a generative model, which means that once a model is trained, a volume can be reconstructed from any point in its latent conformation space. To comprehensively explore this potentially infinite distribution of reconstructed volumes we have developed a set of "landscape analysis" tools for quantitative analysis of a trained cryoDRGN model, including 1) assigning discrete conformational states (and providing their particle lists for refinement) and 2) visualizing continuous conformational landscapes. These tools also allow the user to focus their analysis on specific regions of interest by providing custom masks.
Landscape analysis is implemented in the commands cryodrgn analyze_landscape
and cryodrgn analyze_landscape_full
available in version 1.0+ of the cryoDRGN software. The analysis pipeline is fully automated, though there are many command line arguments that can be experimented with, and we provide a Jupyter notebook for interactive visualization.
A description of the method is found in Chapter 6 of Ellen Zhong’s thesis.
Example usage:
(cryodrgn) $ cryodrgn analyze_landscape -h
By default, the script will:
Generate 1000 volumes at a box size of 128^3
Perform PCA on the volumes to map conformational coordinates. The goal is for the volume PCA coordinates to provide a more visually interpretable representation of the dataset than the VAE latent space.
Cluster the volumes and provide summary volumes and the constituent particles for each cluster.
By default, all outputs will be located in a subdirectory [workdir]/landscape.[epoch]
.
The expected runtime is ~20 min (using 1 Tesla V100 GPU) which is mostly spent on volume generation; rerunning the tool without volume generation (--skip-vol
) should take less than 5 min, even with no GPU.
kmeans1000
: 1000 generated volumes
clustering_L2_average_10
: clustering of the sketched volume ensemble as summary conformational states with:
The mean and stdev volume for each cluster
The constituent particles for each cluster (.pkl), which can be converted to a .star file
pcs
: 5 eigenvolume trajectories of the sketched volume ensemble
Once 1000 volumes are generated, they are clustered to summarize the conformational states of the reconstructed ensemble. This clustering approach mirrors some of the assumptions in 3D classification (i.e. that particles fall in 1 of K discrete classes). The resulting clusters can be interpreted as the main conformational states, and this tool provided the constituent particles as a .star file can be exported to other tools for further refinement.
We use agglomerative clustering, a bottom-up clustering algorithm that does not impose any geometric priors on the shape or size of the clusters. Through testing on several datasets, we find that this is effective at identifying rare states. We also use a mask around the particle to reduce the effect of noise in the background of the density map. A mask may also be manually provided to focus on a specific region (see Section 3).
There are many hyperparameters of the clustering algorithm. As a best practice, one should experiment with the number of clusters (-M
) and the agglomerative clustering affinity type (e.g. --linkage average
or --linkage ward
). See the below subsections for some examples of changing these parameters.
The outputs of clustering will be located in a subdirectory clustering_L2_average_10
Each cluster is described by:
A mean volume and standard deviation volume, e.g. 0_mean.mrc, 0_std.mrc, 1_mean.mrc, 1_std.mrc, ...
A numbered subdirectory (0
, 1
, 2
, etc.) containing the volumes in each cluster
A list of the underlying particles (as an index .pkl
file) that may be converted to a .star file with cryodrgn_utils write_star
Visualization of the 1000 volumes colored by cluster label in the VAE latent space (umap.png, umap_annotated.png) and in the volume PCA space (vol_embeddings_1000*png).
Volume and particle counts for each cluster (volume_counts.png, particle_counts.png)
The default number of clusters is 10. If your dataset is very heterogeneous or if you want a finer resolution clustering, you can increase the number of clusters with the -M
flag. Changing M
corresponds to changing the cut point in the dendrogram of agglomerative clustering. Clustering can be repeated by re-running cryodrgn analyze_landscape
and using the --skip-vol
flag to skip volume generation, for example:
The updated clustering results will be a new subdirectory, clustering_L2_average_[M]
.
The linkage type affects how volumes are merged in the agglomerative clustering algorithm. The default setting is --linkage average
, which, we have found to be sensitive to outliers (e.g. junk/artifacts or rare states of interest). For more evenly populated clusters (e.g. discretizing a structural continuum), try --linkage ward
.
Rerunning landscape analysis with --linkage ward
will produce a new subdirectory, clustering_L2_ward_10
.
Some datasets will contain "junk" volumes that can interfere with clustering and PCA analysis. These volumes can be selected and removed with the utility cryodrgn_utils select_clusters
:
Note: rerunning with --vol-ind
will change the volume PCA results since volumes have now been removed from the analysis.
The cryodrgn analyze_landscape
tool applies a mask on all 1000 volumes before PCA/clustering analysis. The default mask is generated by thresholding all 1000 generated volumes at half of their max density values and then combining all masks (by their union). The mask settings may be adjusted with the --thresh
and --dilate
arguments in cases where the automated mask generation leaves in undesired regions (e.g. extra background) or leaves out heterogeneous regions of the particle. To check the mask, see the mask.mrc
and mask_slices.png
file in the output directory.
Alternatively, a custom mask can be provided with the flag --mask
. Note, the mask will be converted to a binary mask, where any nonzero voxel will be included in the analysis:
Similar to Haselbach et al., we apply principal component analysis (PCA) on the set of cryoDRGN volumes to map reaction coordinates and visualize a conformational landscape that is more interpretable than the cryoDRGN latent variable representation.
The output of cryodrgn analyze_landscape
will include a directory containing the top 5 principal components (.mrc trajectories showing interpolations along the "eigenvolumes"). These principal component trajectories can be interpreted as reaction coordinates for describing the full ensemble, which can provide a more interpretable visualization of the dataset than the latent variable representation.
The initial mapping of the volumes is performed on the set of 1000 sketched volumes. A second tool, cryodrgn analyze_landscape_full
, maps all particles to the volume PCA space to visualize a conformational landscape for the full dataset. This tool will take longer to run for the default (10,000) number of training volumes (~2.5 hours on 1 Tesla V100 GPU for volume generation, 1 min for mapping):
This command also produces a Jupyter notebook, cryoDRGN_landscape_viz.ipynb
, for plotting the inferred conformational landscape.