Handling large datasets with cryoDRGN preprocess
making bulky datasets fit into memory for model training
The cryodrgn preprocess workflow is deprecated as of cryoDRGN version 3.x. We now implement a new data shuffler for fast on-the-fly data loading. The cryoDRGN preprocess workflow is available for versions <=2.3.0.
CryoDRGN by default loads the entire dataset into memory for fast data access during training. However, large cryo-EM datasets can easily exceed the amount of memory available on standard workstations. For these datasets that do not fit into memory, cryodrgn train_vae
can be run with the additional --lazy
flag, which loads images on-the-fly instead of all at once at the beginning of training. This can, however, be very slow due to the filesystem access pattern for on-the-fly image loading, especially if the data is not located on a SSD drive.
To reduce the memory requirement of loading the whole dataset, as of version 0.3.3, cryoDRGN contains a tool cryodrgn preprocess
that performs some of the image preprocessing done at the beginning of training. Separating out image preprocessing significantly reduces the memory requirement of cryodrgn train_vae
, potentially leading to major training speedups ⚡ ⚡ ⚡ .
The new workflow replaces cryodrgn downsample
with cryodrgn preprocess
in the standard cryoDRGN workflow. Note that a new preprocessed particle stack will be created:
Numbers
Some numbers for training on a 1,375,854, 128x128 particle dataset (86 GB)
Baseline:
607 GB maximum memory requirement
18.5 min to load the dataset in
cryodrgn train_vae
With new --preprocessed
:
200 GB maximum memory requirement
3.2 min to load the dataset in
cryodrgn train_vae
On a single Nvidia V100 GPU, this dataset trained in approximately 2h,3min per epoch (large 1024x3 model) when fully loaded into memory. Training with on-the-fly data loading (--lazy
) was 4x slower, though this can vary widely depending on your filesystem/network. Recent tests on cached filesystems do not have a large penalty for --lazy
image loading.
Technicalities
Using
cryodrgn preprocess
in place ofcryodrgn downsample
means that images will be windowed (circular mask applied in real space) before they are downsampled. This is slightly different than in the original workflow where images are first downsampled, then the mask is applied during training. To exactly replicate the previous behavior, runcryodrgn downsample
as usual, then runcryodrgn preprocess
on the downsampled dataset.Viewing particle images in cryoDRGN_viz.ipynb and cryoDRGN_filtering.ipynb will now show Fourier space images. A current workaround is to overwrite the path with the original particles when loading particle images in the jupyter notebook:
Still too large
If your dataset is still too large to load into memory, we recommend training on a subset of the images such that the dataset can fit into memory (e.g. split your dataset into two halves and run independent training jobs on each half). A random selection of a subset of your dataset can be generated with the utility cryodrgn_utils select_random
:
We are currently implementing a large refactor of lazy data loading. Additional updates and information on chunked data loading are tracked here https://github.com/zhonge/cryodrgn/issues/17. Beta code is available in the vb/imagesource
pull request https://github.com/zhonge/cryodrgn/pull/221.
Last updated