CryoDRGN Conformational Landscape Analysis

applying the cryodrgn analyze_landscape commands to better understand a model's latent z-space

Work in progress. Please file a github issue if there are any bugs or start a discussion on github for any questions or suggestions.

The cryoDRGN framework is based around a generative model, which means that once a model is trained, a volume can be reconstructed from any point in its latent conformation space. To comprehensively explore this potentially infinite distribution of reconstructed volumes we have developed a set of "landscape analysis" tools for quantitative analysis of a trained cryoDRGN model, including 1) assigning discrete conformational states (and providing their particle lists for refinement) and 2) visualizing continuous conformational landscapes. These tools also allow the user to focus their analysis on specific regions of interest by providing custom masks.

Landscape analysis is implemented in the commands cryodrgn analyze_landscape and cryodrgn analyze_landscape_full available in version 1.0+ of the cryoDRGN software. The analysis pipeline is fully automated, though there are many command line arguments that can be experimented with, and we provide a Jupyter notebook for interactive visualization.

A description of the method is found in Chapter 6 of Ellen Zhong’s thesis.

1. Quickstart

Example usage:

(cryodrgn) $ cryodrgn analyze_landscape [workdir] [epoch]

# for example:
(cryodrgn) $ cryodrgn analyze_landscape /path/to/cryodrgn/output/directory 24 # assuming 25 epochs of training (0-indexed)

# Use the flag -h to see all settings and their defaults:
(cryodrgn) $ cryodrgn analyze_landscape -h

(cryodrgn) $ cryodrgn analyze_landscape -h

(cryodrgn) $ cryodrgn analyze_landscape -h
usage: analyze_landscape.py [-h] [--device DEVICE] [-o OUTDIR] [--skip-vol]
                            [--skip-umap] [--vol-ind VOL_IND]
                            [--linkage LINKAGE] [-M M] [--pc-dim PC_DIM]
                            [--plot-dim PLOT_DIM] [--Apix APIX] [--flip]
                            [-d DOWNSAMPLE] [--ksample KSAMPLE]
                            [--thresh THRESH] [--dilate DILATE]
                            workdir epoch

Pipeline to analyze cryoDRGN volume distribution

positional arguments:
workdir               Directory with cryoDRGN results
epoch                 Epoch number N to analyze (0-based indexing, corresponding to z.N.pkl, weights.N.pkl)

optional arguments:
-h, --help            show this help message and exit
--device DEVICE       Optionally specify CUDA device
-o OUTDIR, --outdir OUTDIR
                      Output directory for landscape analysis results (default: [workdir]/landscape.[epoch])
--skip-umap           Skip running UMAP
--vol-ind VOL_IND     Index .pkl for filtering volumes

Extra arguments for volume generation:
-N SKETCH_SIZE, --sketch-size SKETCH_SIZE
                      Number of volumes to generate for analysis (default: 1000)
--Apix APIX           Pixel size to add to .mrc header (default: 1 A/pix)
--flip                Flip handedness of output volume
-d DOWNSAMPLE, --downsample DOWNSAMPLE
                      Downsample volumes to this box size (pixels) (default: 128)
--skip-vol            Skip generation of volumes
--vol-start-index VOL_START_INDEX
                      Default value of start index for volume generation (default: 0)

Extra arguments for mask generation:
--thresh THRESH       Density value to threshold for masking (default: half of max density value)
--dilate DILATE       Dilate initial mask by this amount (default: 5 pixels)
--mask MRC            Path to a custom mask. Must be same box size as generated volumes.

Extra arguments for clustering:
--linkage LINKAGE     Linkage for agglomerative clustering (e.g. average, ward) (default: average)
-M M                  Number of clusters (default: 10)

Extra arguments for landscape visualization:
--pc-dim PC_DIM       PCA dimensionality reduction (default: 20)
--plot-dim PLOT_DIM   Number of dimensions to plot (default: 5)

By default, the script will:

Generate 1000 volumes at a box size of 128^3
Perform PCA on the volumes to map conformational coordinates. The goal is for the volume PCA coordinates to provide a more visually interpretable representation of the dataset than the VAE latent space.
Cluster the volumes and provide summary volumes and the constituent particles for each cluster.

By default, all outputs will be located in a subdirectory [workdir]/landscape.[epoch].

The expected runtime is ~20 min (using 1 Tesla V100 GPU) which is mostly spent on volume generation; rerunning the tool without volume generation (--skip-vol) should take less than 5 min, even with no GPU.

Outputs at a glance

kmeans1000: 1000 generated volumes
clustering_L2_average_10: clustering of the sketched volume ensemble as summary conformational states with:
- The mean and stdev volume for each cluster
- The constituent particles for each cluster (.pkl), which can be converted to a .star file
pcs: 5 eigenvolume trajectories of the sketched volume ensemble

2. Assigning 3D conformational states ("classes")

Once 1000 volumes are generated, they are clustered to summarize the conformational states of the reconstructed ensemble. This clustering approach mirrors some of the assumptions in 3D classification (i.e. that particles fall in 1 of K discrete classes). The resulting clusters can be interpreted as the main conformational states, and this tool provided the constituent particles as a .star file can be exported to other tools for further refinement.

We use agglomerative clustering, a bottom-up clustering algorithm that does not impose any geometric priors on the shape or size of the clusters. Through testing on several datasets, we find that this is effective at identifying rare states. We also use a mask around the particle to reduce the effect of noise in the background of the density map. A mask may also be manually provided to focus on a specific region (see Section 3).

There are many hyperparameters of the clustering algorithm. As a best practice, one should experiment with the number of clusters (-M) and the agglomerative clustering affinity type (e.g. --linkage average or --linkage ward). See the below subsections for some examples of changing these parameters.

Outputs

The outputs of clustering will be located in a subdirectory clustering_L2_average_10

Each cluster is described by:
- A mean volume and standard deviation volume, e.g. 0_mean.mrc, 0_std.mrc, 1_mean.mrc, 1_std.mrc, ...
- A numbered subdirectory (0, 1, 2, etc.) containing the volumes in each cluster
- A list of the underlying particles (as an index .pkl file) that may be converted to a .star file with cryodrgn_utils write_star
Visualization of the 1000 volumes colored by cluster label in the VAE latent space (umap.png, umap_annotated.png) and in the volume PCA space (vol_embeddings_1000*png).
Volume and particle counts for each cluster (volume_counts.png, particle_counts.png)

(cryodrgn) $ ls clustering_L2_average_10/
0                   2_std.mrc           5_particle_ind.pkl  8_mean.mrc           vol_embeddings_1000_annotated_1_2.png
0_mean.mrc          3                   5_std.mrc           8_particle_ind.pkl   vol_embeddings_1000_annotated_2_3.png
0_particle_ind.pkl  3_mean.mrc          6                   8_std.mrc            vol_embeddings_1000_annotated_3_4.png
0_std.mrc           3_particle_ind.pkl  6_mean.mrc          9                    vol_embeddings_1000_annotated_4_5.png
1                   3_std.mrc           6_particle_ind.pkl  9_mean.mrc           vol_embeddings_1000_clusters_1_2.png
1_mean.mrc          4                   6_std.mrc           9_particle_ind.pkl   vol_embeddings_1000_clusters_2_3.png
1_particle_ind.pkl  4_mean.mrc          7                   9_std.mrc            vol_embeddings_1000_clusters_3_4.png
1_std.mrc           4_particle_ind.pkl  7_mean.mrc          state_labels.pkl     vol_embeddings_1000_clusters_4_5.png
2                   4_std.mrc           7_particle_ind.pkl  particle_counts.png  volume_counts.png
2_mean.mrc          5                   7_std.mrc           umap_annotated.png
2_particle_ind.pkl  5_mean.mrc          8                   umap.png

Optional: Change the number of clusters with -M:

The default number of clusters is 10. If your dataset is very heterogeneous or if you want a finer resolution clustering, you can increase the number of clusters with the -M flag. Changing M corresponds to changing the cut point in the dendrogram of agglomerative clustering. Clustering can be repeated by re-running cryodrgn analyze_landscape and using the --skip-vol flag to skip volume generation, for example:

(cryodgn) $ cryodrgn analyze_landscape [workdir] [epoch] --skip-vol -M 20 # use --skip-vol to skip regenerating 1000 volumes

# Results will be found in a new subdirectory, clustering_L2_average_[M]

The updated clustering results will be a new subdirectory, clustering_L2_average_[M].

Optional: Change agglomerative clustering linkage with —linkage:

The linkage type affects how volumes are merged in the agglomerative clustering algorithm. The default setting is --linkage average, which, we have found to be sensitive to outliers (e.g. junk/artifacts or rare states of interest). For more evenly populated clusters (e.g. discretizing a structural continuum), try --linkage ward.

(cryodgn) $ cryodrgn analyze_landscape [workdir] [epoch] --skip-vol --linkage ward # use --skip-vol to skip regenerating 1000 volumes

Rerunning landscape analysis with --linkage ward will produce a new subdirectory, clustering_L2_ward_10.

Optional: Remove junk clusters/volumes

Some datasets will contain "junk" volumes that can interfere with clustering and PCA analysis. These volumes can be selected and removed with the utility cryodrgn_utils select_clusters:

# e.g. keep the volumes from clusters 0, 1, 2, 4, 9
(cryodrgn) $ cryodrgn_utils select_clusters clustering_L2_average_10/state_labels.pkl --sel 0 1 2 4 9 -o vol_keep.pkl

# rerun cryodrgn analyze_landscape with --vol-ind
(cryodrgn) $ cryodrgn analyze_landscape [workdir] [epoch] --skip-vol --vol-ind vol_keep.pkl

Note: rerunning with --vol-ind will change the volume PCA results since volumes have now been removed from the analysis.

3. Masking (optional)

The cryodrgn analyze_landscape tool applies a mask on all 1000 volumes before PCA/clustering analysis. The default mask is generated by thresholding all 1000 generated volumes at half of their max density values and then combining all masks (by their union). The mask settings may be adjusted with the --thresh and --dilate arguments in cases where the automated mask generation leaves in undesired regions (e.g. extra background) or leaves out heterogeneous regions of the particle. To check the mask, see the mask.mrc and mask_slices.png file in the output directory.

Alternatively, a custom mask can be provided with the flag --mask. Note, the mask will be converted to a binary mask, where any nonzero voxel will be included in the analysis:

# rerun cryodrgn analyze_landscape with custom mask
(cryodrgn) $ cryodrgn analyze_lanscape [workdir] [epoch] --skip-vol --mask /path/to/your/mask/eg/job008.mrc

4. Visualizing conformational landscapes

Similar to Haselbach et al., we apply principal component analysis (PCA) on the set of cryoDRGN volumes to map reaction coordinates and visualize a conformational landscape that is more interpretable than the cryoDRGN latent variable representation.

Output

The output of cryodrgn analyze_landscape will include a directory containing the top 5 principal components (.mrc trajectories showing interpolations along the "eigenvolumes"). These principal component trajectories can be interpreted as reaction coordinates for describing the full ensemble, which can provide a more interpretable visualization of the dataset than the latent variable representation.

5. Visualizing the full conformational landscape

The initial mapping of the volumes is performed on the set of 1000 sketched volumes. A second tool, cryodrgn analyze_landscape_full, maps all particles to the volume PCA space to visualize a conformational landscape for the full dataset. This tool will take longer to run for the default (10,000) number of training volumes (~2.5 hours on 1 Tesla V100 GPU for volume generation, 1 min for mapping):

(cryodrgn) $ cryodrgn analyze_landscape_full [workdir] [epoch] > landscape_full.log

This command also produces a Jupyter notebook, cryoDRGN_landscape_viz.ipynb, for plotting the inferred conformational landscape.

PreviousCryoDRGN-ET Subtomogram Analysis NextMaking long trajectories with cryoDRGN graph_traversal

Last updated 8 months ago