The three-dimensional structure of proteins plays a crucial role in determining their function. Protein structure prediction methods, like AlphaFold, offer rapid access to a protein’s structure. However, large protein complexes cannot be reliably predicted, and proteins are dynamic, making it important to resolve their full conformational distribution. Single-particle cryo-electron microscopy (cryo-EM) is a powerful tool for determining the structures of large protein complexes. Importantly, the numerous images of a given protein contain underutilized information about conformational heterogeneity. These images are very noisy projections of the protein, and traditional methods for cryo-EM reconstruction are limited to recovering only one or a few consensus conformations. In this paper, we introduce cryoSPHERE, which is a deep learning method that uses a nominal protein structure (e.g., from AlphaFold) as input, learns how to divide it into segments, and moves these segments as approximately rigid bodies to fit the different conformations present in the cryo-EM dataset. This approach provides enough constraints to enable meaningful reconstructions of single protein structural ensembles. We demonstrate this with two synthetic datasets featuring varying levels of noise, as well as one real dataset. We show that cryoSPHERE is very resilient to the high levels of noise typically encountered in experiments, where we see consistent improvements over the current state-of-the-art for heterogeneous reconstruction.
Single-particle cryo-electron microscopy (cryo-EM) is a powerful technique for determining the three-dimensional structure of biological macromolecules, including proteins. In a cryo-EM experiment, millions of copies of the same protein are first frozen in a thin layer of vitreous ice and then imaged using an electron microscope. This yields a micrograph: a noisy image containing 2D projections of individual proteins. The protein projections are then located on this micrograph and cut out so that an experiment typically yields ten thousands to ten millions images of individual proteins, referred to as particles. Our goal is to reconstruct the possible structures (called conformations) of the proteins given these images. Frequently, proteins are conformationally heterogeneous and each copy represents a different structure. Conventionally, this information has been discarded, and all of the sampled structures were assumed to be in only one or a few conformations (homogeneous reconstruction). Here, we would like to recover all of the structures in a heterogeneous reconstruction. There are a number of challenges:
Traditional methods reconstruct a volume of the protein, typically in Fourier space. Framing the problem this way has two shortcomings that make these methods susceptible to the low SNR:
CryoSPHERE takes the number of segments N used to cut the protein as a hyperparameter. During a forward pass, two steps happen concurrently:
cryoSPHERE outputs, for each image:
One can make a PCA of the latent space recovered by cryoSPHERE and create structures corresponding to each of the principal component traversal. This provides a direct interpretation of motion recovered by cryoSPHERE.