Singular Value Decomposition (CHEOPS))
The strong dependence of isotropic chemical shifts on the local backbone geometry has motivated the development of methods to determine protein torsion (or dihedral) angles from a basic set of shifts (reviewed in [ 2 ]). The backbone and sidechain dihedral angles define the local conformation of a polypeptide chain, thus governing secondary structure, sidechain packing, and overall tertiary/quaternary folds. As a result, determination of peptide back-bone (ɸ, ψ and ω) and sidechain (χ i ) dihedral angles directly from chemical shifts provides valuable restraints during structure calculation and refinement [ 25 – 27 ], especially in the absence of NOE restraints. While torsion angles can be directly measured through scalar J-couplings [ 28 – 30 ] or dipole-dipole and dipole-chemical shift anisotropy (CSA) cross-correlated relaxation [ 31 , 32 ], these methods are less applicable to larger proteins where NMR resonances undergo significant line-broadening and reduction in signal-to-noise, with both effects limiting the applications of these experiments [ 33 ].
Whereas DFT (Density Functional Theory)-based calculations can provide valuable insights into the dependence of chemical shift values on local geometry for different nuclei (such as 1 H α , 15 N, 13 C α , 13 C β , 13 Cʹ), empirical approaches have generally been more successful in directly modeling the local backbone structure [ 4 ]. Here, torsion angles are predicted based on amino acid sequence and chemical shift similarity relative to a curated database of assigned chemical shifts/protein structure pairs derived from the PDB and Biological Magnetic Resonance Bank (BMRB) [ 34 ]. Several methods stemming from this approach have already been reviewed in [ 4 , 5 , 7 ]. As expected, the accuracy of these chemical shift-based methods is higher (>73% of the torsion angle predictions lie within 30° from corresponding angles in the reference structures) compared to approaches that exclusively use sequence information (>65% of the predictions lie within 36° from the reference angles) [ 35 ]. Larger databases, together with more consistent referencing of chemical shift values are expected to improve the accuracy and precision of torsion angle predictions even further [ 36 , 37 ]. In addition to torsion angles, alternative methods use chemical shift data to elucidate secondary structures [ 4 , 5 , 7 ] and less frequently occurring Xaa-Pro peptide bond conformations [ 38 ]. While all these methods provide local restraints, which can be applied directly in structure calculations, the majority of NMR structure determination programs are supplemented with additional sources of NMR data in order to obtain highly converged models.
Alongside the efforts to derive structural restraints from chemical shifts, prediction of the chemical shifts of known structures is an active field of research where a variety of sequence-based, structure-based and hybrid approaches have been developed [ 4 , 5 ]. Accurate chemical shift predictions from an available X-ray structure can actively aid in making chemical shift assignments, as well as in structure modeling and validation [ 5 , 39 ]. Analogous to dihedral angle prediction methods discussed in Section 2, sequence-based chemical shift prediction methods are based on the concept that sequence similarity often results in local structure and chemical shift homology. This idea forms the basis of SHIFTY [ 40 ], an early method, that is able to predict 1 H and 13 C backbone chemical shifts with a Pearson’s correlation coefficient (between experimental and predicted) >0.85 for all atoms (proton and carbon) when the query protein shares >35% sequence identity to a reference structure with known chemical shift assignments available in the BMRB. Following SHIFTY, there have been a number of methods (extensively reviewed in [ 4 , 5 , 7 ]), which can predict chemical shifts of up to 40 atom types in less than a few seconds per residue. The correlation coefficients of a few of these methods (SHIFTX [ 41 ], SPARTA (Shifts Predicted from Analogy in Residue type and Torsion Angle) [ 42 ], SPARTA+ [ 43 ], CamShift [ 44 ], SHIFTS [ 45 ], PROSHIFT [ 46 ] and SHIFTX2 [ 47 ]) range from 0.7 to 0.99 for 15 N, 13 C α , 13 C α , 13 C γ , 1 H N , and 1 H α backbone atoms; arguably, an exception is 1 H N , where SHIFTX2 (correlation coefficient of 0.97) outperforms other methods by a large margin (correlation coefficient of other methods lie between 0.51 and 0.71), when tested on a benchmark set of 61 proteins [ 47 ]. SHIFTX/SHIFTX2 and SPARTA/SPARTA+ are comparable in performance and widely used methods due to their speed and accuracy. The practical utility of some of these methods in de novo structure determination is discussed throughout this manuscript.
Structure prediction methods have shown great success for small to medium sized proteins (<150 residues) using various strategies, including ab initio [ 48 , 49 ], comparative modeling [ 50 ], fold prediction and threading [ 51 ]. However, de novo modeling of larger proteins remains a challenging problem owing to the number of feasible solutions to the conformational search problem [ 52 ]. In spite of the computational complexities involved in ab initio structure determination, there has been significant progress in the development of sophisticated methods in the past two decades. Rosetta [ 53 ], QUARK [ 54 ] and I-TASSER (Iterative Threading Assembly Refinement) [ 55 ] are a few software packages that have been widely applied to construct 3D structural models starting only from a query amino acid sequence. These methods are particularly useful when no homologs with known structures can be identified in the PDB, which is often the case for larger proteins [ 48 ].
Bowie and Eisenberg pioneered the field of ab initio prediction with their concept of generating protein models from an assembly of short, overlapping backbone fragments derived from a structural database [ 56 ] and this idea laid the foundation for several early implementations of ab initio methods [ 57 , 58 ]. In these methods, the selection of fragments from a high resolution protein structure database is based on sequence or secondary structure homology. Following selection, fragments are assembled using Monte Carlo-simulated annealing methods that minimize physically realistic energy functions to produce 3D structural models. Although these methods can produce low-energy models exhibiting the native fold for small proteins (<100 residues), larger targets pose significant challenges due to the quality of fragments used for assembly and the exponential increase in the conformational search space. In order to attempt to overcome the drawbacks of these early ab initio methods, several protocols that exploit NMR chemical shifts have emerged (reviewed in [ 7 ]). A great majority of these methods employ the generalized fragment assembly framework ( Fig. 2 ). Here, sequence and chemical shifts are used to derive local structural features, such as torsion angle restraints and secondary structure information, which further guide the fragment selection from a database of high resolution X-ray structures. The selected fragments are then used to build low-resolution models starting from a fully extended protein chain, characterized by bond lengths, bond angles, and backbone torsion angles. Here, bond lengths and angles are typically fixed to ideal values and the peptide bond is assumed to be planar, therefore it is the backbone torsion angles (/ and w) that effectively define the conformation of a protein chain [ 59 , 60 ]. This reduction in the degrees of freedom from Cartesian to torsion angle space greatly boosts the performance of a search towards the native conformation using Monte Carlo-based optimization methods. Lastly, sidechain rotamers [ 61 ] and minor deviations from ideal values are introduced on low-resolution conformations, which undergo further refinement to reduce steric clashes, and finally to produce all-atom structural models.
General pipeline for de novo structure determination using fragment assembly. Backbone fragments are first generated from high resolution structures obtained from a curated database derived from the PDB. Fragments are then ranked according to primary amino acid sequence information and/or chemical shift-based torsion angle predictions. The assembly of selected fragments generates low-resolution models, which are iteratively refined using a physically relevant energy function to yield the final structures.
An early fragment assembly method (Molecular Fragment Replacement or MFR) utilizes experimental chemical shifts and dipolar couplings to model low-resolution structures [ 62 ]. Akin to ‘‘Molecular Replacement” methods, widely used in X-ray crystallography refinement, this approach is inspired from previous work that determined local structural fragments using sparse NOE data [ 63 ]. Specifically, MFR performs a pairwise search of a fragment database where the best candidates are selected by a χ2-test that evaluates the difference between (i) measured and calculated dipolar couplings from a singular value decomposition procedure (dipolar homology) and (ii) experimental and predicted chemical shift values for each selected fragment. The well-fitting fragments provide backbone torsion angle restraints that are applied during low-resolution structure modeling. Finally, the predicted models are further refined in order to improve their agreement with experimental chemical shifts and dipolar couplings. The utility of MFR is highlighted by a measured backbone RMSD (Root Mean Square Deviation) of 1.2 Å (angstrom) between modelled and X-ray structures of ubiquitin [ 62 ], suggesting that folds for small proteins can be captured using solely the chemical shifts and dipolar couplings, thereby alleviating the need to acquire and analyze NOESY data. Further improvements have been made to this algorithm at various stages including fragment search, assembly, sidechain placement, and structure refinement by employing other NMR parameters, such as J-couplings and NOEs [ 64 , 65 ]. While the early MFR method could accurately model backbone structures of small proteins, a significant limitation remained with respect to sidechain placement [ 62 ], which has been addressed in more recent methods [ 64 , 65 ].
One of the methods that surpassed previously existing approaches in structure prediction accuracy, addressed known limitations and worked well for a wide range of molecular weights is CHESHIRE (Chemical Shift Restraints) [ 59 ]. The CHESHIRE procedure further extends the fragment-based strategy introduced by MFR [ 62 ] together with NMR chemical shifts to model protein structures. This procedure consists of three phases that follow a generalized fragment assembly framework ( Fig. 2 ). First, the 3PRED algorithm [ 59 ] is used to predict secondary structures of three- and nine-residue fragments using NMR chemical shifts in conjunction with sequence-based secondary structure propensities. Specifically, the experimental chemical shifts are used to estimate the probability of an amino acid adopting a given secondary structure type. Additionally, secondary structure propensities are computed from known structures in the ASTRAL Structural Classification of Proteins (SCOP) database [ 66 ] according to a classification performed by the STRIDE (Structural Identification) algorithm [ 67 ]. Second, the TOPOS algorithm [ 59 ] similar to TALOS (Torsion Angle Likelihood Obtained from Shift and sequence similarity) [ 68 ] is used to predict backbone torsion angles using combined information drawn from experimental chemical shifts and previously determined secondary structural elements. The primary difference between TOPOS and TALOS lies in how they use chemical shifts for scoring (for instance, TOPOS ignores 1 H N chemical shifts). Following torsion angle prediction, candidate fragment conformations are selected from a structural database and filtered according to an energy function consisting of empirical terms sensitive to torsion angles, secondary structure and agreement between experimental and back-calculated chemical shifts (computed via SHIFTX). Third, fragments are assembled to generate low-resolution models using a Monte Carlo-simulated annealing method, where the query protein chain adopts a simplified representation consisting of only the backbone atoms and C β atoms of sidechain groups. Finally, sidechain rotamers from the Dunbrack library [ 61 ] are added to the low-resolution models following optimization of an all-atom energy function using Monte Carlo-based techniques. The all-atom energy function ( E ) contains a chemical shift term (back-calculated from the predicted models using SHIFTX) and a molecular dynamics (MD) force field according to Eq. (1) :
Here, the numerator recapitulates an MD-derived force field containing terms from van der Waals, electrostatic, solvent, pair-wise mean force and hydrogen bonding effects [ 59 ]. The denominator is an experimental scoring function, where CX measures the correlation between experimental and back-calculated chemical shifts using SHIFTX for Xϵ { H α , N , C α , C β }atoms. Corresponding k values are user-defined constants.
In the original implementation of CHESHIRE’s refinement procedure, the optimization of a combined MD/chemical shift scoring function is sufficient to bias the calculations towards the native state [ 59 ]. In particular, the derivatives of the chemical shift-based terms are not explicitly computed, which would be required for any approaches employing an MD-based integration of Newton’s equations of motion. Apart from exhibiting high accuracy in predicting benchmarked proteins of sizes up to ~14 kDa (Kilodalton) (backbone atom RMSD <1.8 Å) [ 59 ], this approach performed very well (RMSD values <2.6 Å for proteins up to 160 aa (amino acids) in size) in the third round of CASD-NMR (Community Wide Assessment of NMR Structure Determination) evaluation [ 69 , 70 ].
While the CHESHIRE procedure clearly demonstrated that chemical shift-derived fragments can be used to build nearnative structures, the same was independently highlighted by the Chemical Shift (CS)-Rosetta approach ( Fig. 3A ) [ 60 ]. CS-Rosetta combines a highly optimized ab initio fragment assembly protocol [ 53 ], employing different sampling schemes and multiple low-resolution energy functions, together with NMR chemical shifts to yield accurate structural models [ 60 ]. This protocol leverages high resolution structures from the PISCES (Public server for culling sets of protein sequences from the PDB by sequence identity) database [ 71 ], corresponding secondary structures assigned by DSSP (Dictionary of Secondary Structure Predictions) [ 72 ] and predicted chemical shifts of 13 C α , 13 C β , 13 C ’ , 15 N, 1 H β and 1 H N nuclei from SPARTA, to generate a library of native-like fragments, as opposed to fragments obtained from sequence information alone ( Fig. 3A ). In the earlier implementation of this protocol, the fragment selection was carried out using the MFR approach [ 62 ], which was later superseded by a modular algorithm that incorporates various experimental data terms and/or other prior biases during the fragment selection process [ 73 ]. The fragments selected using chemical shifts possess ɸ, ψ backbone torsion angles that are closer to their native values as shown for two representative arginine residues (at positions 45 and 61) in a 72 aa protein ( Fig. 3B and C ). Following three- and nine-residue fragment selection, assembly and refinement are carried out using Rosetta’s Metropolis Monte Carlo procedure. CS-Rosetta makes use of Rosetta’s simplicity and speed during the fragment assembly process. As mentioned previously, a protein chain in Rosetta is represented using torsion angle coordinates. By convention, if any torsion angle within a protein chain is perturbed, the angular motion affects all the atoms towards the C-terminus (called the lever-arm effect). To eliminate this effect, a protein chain is depicted using a directed (from N- to C-terminus), acyclic graph called fold tree [ 53 , 74 ]. In a fold tree, the nodes represent residues and the edges represent covalent connections. During the angular motion of torsions, breaks are introduced in a protein chain to eliminate the lever-arm effect. The fold information is preserved using long-range connections, which are also added as edges in the tree [ 53 , 74 ]. The use of fold tree framework greatly simplifies the assembly process by allowing the sampling of non-local structural features while remaining in torsion angle space [ 74 ].
Chemical shifts aid the selection of high quality backbone fragments in CS-Rosetta. ( A ) C α RMSD to native structure among the top twenty, three-residue backbone fragments for each residue position in the sequence of a 72 aa. query protein. Fragments are selected based on sequence profile and secondary structure prediction in Rosetta (red). Alternatively, chemical shifts can be used to bias the fragment selection process in CS-Rosetta (blue). ( B and C ) Distribution of ϕ,ψ backbone dihedral angles in the top 100 fragments derived using Rosetta (red) or CS-Rosetta (blue) for two representative Arg residues at positions ( B ) 45 and ( C ) 61. Green dots indicate the ϕ,ψ dihedral angles observed in the native structure (X-ray).
The low-resolution phase employs a simplified representation of a protein chain, where only backbone heavy atoms and a centroid site representing the sidechain are present for every residue. Thereafter, Monte Carlo fragment trial steps are applied, starting from a fully extended protein chain, to yield collapsed backbone folds. In this phase, the low-resolution Rosetta energy function [ 75 – 77 ] is continuously minimized while sampling fragments from the library. In the high resolution phase, sidechain atoms are added to low-resolution structures. The placement of sidechain atoms is challenging due to the exponential number of possible rotamer combinations for all amino acids in a query protein sequence. To solve this problem, Rosetta uses a module called packer [ 53 ]. Packer selects feasible rotamers for each amino acid from the Dunbrack library [ 61 ], and further uses Monte Carlo-simulated annealing method to search for the optimal rotamer combinations. The high resolution (or full-atom) models are further refined using a Monte Carlo- and gradient-based optimization process that performs small backbone perturbations to resolve steric clashes. The final full-atom models are retained based on a high resolution scoring function [ 75 – 77 ], and quality of fit of the predicted models to the experimental chemical shifts. The compliance of the predicted models to experimental shifts is assessed by back-calculating the chemical shifts from the models using SPARTA. The chemical shift deviation in the models is further used to adjust the Rosetta all-atom energy, E, according to Eq. (2) :
Here, i and j are the nuclei and residues respectively δ i , j p r e d is the back-calculated chemical shift value obtained from SPARTA, δ i , j e x p is the experimental chemical shift value, σ i;j is a standard deviation and c is a weight factor, which can be optimized according to benchmark calculations.
The reliability and accuracy of CS-Rosetta were demonstrated through the comparison of predicted models with structures determined experimentally by the Northeast Structural Genomics (NESG) consortium [ 60 ]. The lowest-energy predicted models were remarkably close to the solved X-ray or NMR structures (0.6–2.1 Å backbone RMSD). In this review, we illustrate the effects of chemical shift data in guiding structure modeling within CS-Rosetta for a 15.5 kDa target protein, RTT103 (Regulator of Ty1 Transposition) ( Fig. 4 ) [ 78 , 79 ]. Here, models calculated using chemical shifts are closer to the native structure (PDB ID 2KM4) [ 80 ] and are well-converged, as opposed to the models calculated from sequence and secondary structure prediction alone ( Fig. 4A–C ). The distributions of Rosetta energies in the sampled models further highlight that, through the selection of more native-like fragments, chemical shifts greatly limit the conformational space and bias the search towards a native structure ( Fig. 4D ). When CS-Rosetta was originally tested, it could accurately model structures of proteins of sizes up to 15 kDa. However, the size remained well below the marked solution NMR standard of 25–30 kDa at the time ( Fig. 1 ). The size limitation of CS-Rosetta was subsequently remediated by an improvement in the ab initio fragment sampling protocol and the use of backbone RDC restraints along with chemical shifts- (CS-RDC-Rosetta: a b-version of RASREC-Rosetta, discussed below) [ 81 ]. This updated protocol allowed modeling of proteins up to 25 kDa in size.
Improved de novo structure modeling using chemical shifts. Structure calculations for a 15.5 kDa protein (RTT103) are shown with and without the use of chemical shifts. ( A ) Ten lowest-energy structures generated by CS-Rosetta using manually assigned backbone chemical shifts. ( B ) Ten lowest-energy structures generated by CS-Rosetta without the use of chemical shifts. ( C ) Convergence vs backbone RMSD (Å) to native (PDB ID 2KM4), of the ten lowest-energy structures calculated with (green) and without (crimson) the use of chemical shifts. The native structures (PDB ID 2KM4) [ 80 ], which were idealized and refined in the Rosetta force-field for comparison, are additionally provided (purple). ( D ) Rosetta Energy (in R.E.U.) distributions among the 100 lowest-energy structures generated by CS-Rosetta with (green) and without (crimson) the use of chemical shift fragments in ab initio calculations. R.E.U. – Rosetta Energy Units. Data obtained from [ 216 ].
To further address the conformational sampling problem in order to increase the size limit (>25 kDa) of de novo structure determination, the Resolution Adapted Structural Recombination (RASREC)-Rosetta protocol was developed [ 82 ]. RASREC-Rosetta is designed to address difficult targets containing complex topologies with many sequentially distant (or nonlocal) interactions, and makes use of optimization strategies implemented by other protocols [ 74 , 83 – 86 ] alongside important features, including the fold tree framework from Rosetta 3.0 [ 53 ]. Each optimization strategy is customized and embedded in different phases of this multi-staged, iterative approach. Specifically, RASREC-Rosetta has an exploration stage followed by five resampling stages. Every stage retains the best scoring candidate structures that serve as a knowledge base, in conjunction with all available experimental data, for subsequent stages. In stages 1–3, the protein chain represented as a fold tree explores different backbone fragments and long-range β-strand pairings. The possible β-strand arrangements are made available through an annotated library constructed from all b- sheets within high resolution X-ray structures in the PDB. During stages 3–6, the sets of three- and nine-residue fragments being sampled are enriched by segments from the low-resolution structures generated in the earlier stages, to promote resampling of native-like features. Hence, consistently observed structural features that form the core of a protein are retained. These features aid in distinguishing incorrect folds from the correct one in the later stages. Most importantly, if any set of low-resolution candidates do not exhibit core features and seem unfit for full-atom refinement by stage 4, the protocol reverts to earlier stages and restarts from the fold trajectories of successful candidates. After approaching the full-atom refinement stages (stages 5,6), fold tree chain breaks introduced during beta-strand topology resampling are closed to yield realistic candidate structural models. In addition to the implementation of an array of optimization strategies, RAS-REC benefits from unprecedented parallelization through MPI (Message Passing Interface), which allows for batches of structure modeling calculations to be distributed across all available cores in a computer cluster. This greatly enhances the sampling range and overall performance of the method.
RASREC-Rosetta exhibited improved performance relative to conventional CS-Rosetta on a benchmark set of 11 proteins in 15–25 kDa size range where very sparse amide NOE restraints (of the order of tens) in addition to backbone chemical shifts were provided. Since then, it has shown considerable accuracy when applied to larger proteins with complex folds by several research groups. In particular, RASREC-Rosetta was used together with sparse NOE data recorded for 11 protein targets of sizes up to 40 kDa [ 87 ]. Here, the collection of high quality structural restraints from fully protonated samples poses a challenge due to slower rotational diffusion rates of larger proteins (>20 kDa), which leads to low signal-to-noise ratios. The authors addressed this drawback by employing a selective methyl labeling scheme (ILV, Isoleucine Leucine Valine) in a perdeuterated background to record methyl-methyl, methyl-amide and amide-amide 1 H NOE contacts at increased sensitivity and resolution [ 87 – 93 ]. These sparse sets of NOE restraints, together with experimental chemical shifts of the backbone atoms, were used by RASREC-Rosetta to model structures at high levels of accuracy (median C α RMSD <2 Å). In another application, a structural model of the murine cytomegalovirus (CMV) immunoevasin m04, a 23 kDa protein, was determined by RASREC-Rosetta using chemical shifts together with several data sets of complementary NOE and RDC measurements [ 94 ]. By progressively increasing the number of structural restraints supplied to RASREC-Rosetta, well-converged structural models were obtained revealing a novel, complex b-sheet topology similar to the immunoglobulin (Ig) protein fold ( Fig. 5A–C ). Notably, the degree of structural convergence increased by 16% and 22% with an increase in the total number of local restraints provided by a modest set of long-range (i) amide-amide ( Fig. 5A and B ) and (ii) amide-amide, amide-methyl, methyl-methyl ( Fig. 5C ) NOEs recorded using non-uniform sampling methods (NUS) ( Fig. 5D ). In addition to achieving higher levels of convergence, the use of all available restraints (amide-amide, amide-methyl, methyl-methyl NOE contacts and two RDC datasets) allowed modeling of structures that were within 0.6 Å (backbone RMSD) from the X-ray structure (determined independently by a different group and published after the RASREC-Rosetta models), and showed the correct placement of all core sidechains ( Fig. 5E ).
Convergence of a novel fold adopted by the murine cytomegalovirus m04 protein during iterative RASREC-Rosetta structure calculations with various sparse NMR data sets. Ten lowest-energy structural models calculated using NMR chemical shifts together with ( A ) amide-amide NOE restraints ( B ) amide-amide NOE and RDC restraints ( C ) amide-amide, amide-methyl and methyl-methyl NOE together with RDC restraints, show that 73%, 89% and 95% of residues are converged within 3 Å (backbone RMSD), respectively. ( D ) 2D 13 C– 13 C projection of a 4D methyl HMQC-NOESY-HMQC (Heteronuclear Multiple Quantum Coherence) spectra recorded without (left) and with (right) non-uniform sampling (NUS). The NUS experiment was recorded with a sparsity of 1.56% and reconstructed using the SMILE algorithm in NMRPipe [ 246 , 247 ]. Both the spectra were acquired with similar parameters and the same net acquisition time. ( E ) An overlay of the m04 protein structure determined by solution NMR (cyan, PDB ID 2MIZ) [ 94 ] and X-ray crystallography (green, PDB ID 4PN6) [ 248 ] with backbone heavy atom RMSD of 0.6 Å. This figure is partially (Panels A, B and C) adapted from [ 94 ] with permission.
To illustrate the advantages of fragment-based approaches over torsion angle dynamics methods (the most popular approaches for NMR-based structure determination among all entries in the PDB), we calculated new structures of Abl kinase RM (Regulatory Module) protein complex using RASREC-Rosetta and compared them with PDB deposited models, which were generated using CYANA (Combined Assignment and Dynamics Algorithm for NMR Applications) (PDB ID 6AMW) [ 95 , 96 ]. The CYANA structures were modeled using chemical shift derived torsion angle restraints together with 3830 short- and long-range NOEs and an additional set (consisting of 80 restraints) of ‘NOE-derived’ hydrogen bond restraints [ 96 ]. In contrast, for the RASREC-Rosetta calculations we used chemical shift-derived torsion angle restraints and a sparse set (1547 long-range) of NOEs (a subset of restraints provided to CYANA; BMRB ID 30332) to guide the structure determination ( Fig. 6 ). The ten lowest-energy models ( Fig. 6A ) obtained using RASREC-Rosetta showed improved convergence with respect to the average relative orientation of the two individual domains relative to the models produced using CYANA, illustrated by structure superpositions performed using either the SH3 domain (residues 83–138) ( Fig. 6B ) (SH: Src Homology) or the SH2 domain (residues 139–237, linker and SH2) ( Fig. 6C ). As highlighted by our results, the use of fragment-based approaches with advanced sampling strategies and a more elaborate high resolution energy function leads to improved convergence of models from a lower restraint density ( Fig. 6D–F ).
Comparison of the Abl kinase regulatory module structural ensembles calculated using RASREC-Rosetta and CYANA. ( A ) Globally aligned ten lowest-energy models of Abl kinase regulatory module (residues 83–237) calculated by RASREC-Rosetta (left) and CYANA (PDB ID 6AMW) (right) using amide-amide, amide-methyl, methyl-methyl NOE and H-bond (for CYANA) restraints. ( B ) Ten lowest-energy models calculated using RASREC-Rosetta (left) and CYANA (right) superimposed with respect to domain A (residues 83–138, SH3 domain). ( C ) Ten lowest-energy models calculated using RASREC-Rosetta (left) and CYANA (right) superimposed with respect to domain B (139–237, connector and SH2 domain). ( D ) Total number of amide-amide, amide-methyl and methyl-methyl NOE restraints used by RASREC-Rosetta and CYANA during iterative structure calculation and refinement. CYANA additionally uses a set of ‘NOE-derived’ hydrogen bonds (H-bond). Amide (orange): amide to amide and amide to methyl NOE contacts. Aliphatic (gray): methyl-methyl NOE contacts. H-bond restraints (red). ( E and F ) Average pairwise backbone heavy-atom RMSDs (in Å) using structural superimpositions performed with respect to different domain selections are shown per residue for structural ensembles calculated using CYANA and RASREC-Rosetta, respectively. Full alignment (blue): global alignment of ten lowest-energy structures. Domain A alignment (crimson): alignment of ten lowest-energy structures with respect to the SH3 domain. Domain B alignment: alignment of ten lowest-energy structures with respect to the connector and SH2 domain. Structural models of Abl kinase RM calculated using RASREC-Rosetta and the corresponding NMR data are available at https://dash.library.ucsc.edu/stash/dataset/doi:10.7291/D1Q94R .
Whereas MFR [ 62 ], CHESHIRE [ 59 ] and CS-Rosetta [ 60 , 82 ] utilize chemical shifts in conjunction with known structures to derive a selection of low-resolution backbone fragments, they do not take full advantage of the high resolution structural information encoded within the data. In these methods, the primary use of experimental chemical shifts is through a comparison against back-calculated (via SPARTA or SHIFTX) chemical shift values used to measure the compliance of selected fragments or models computed using Monte Carlo-based optimization methods. While Monte Carlo-based methods fare very well during structure refinement, they also have a very high rejection rate of trial moves (about 90%) during random exploration while modeling unknown structures. Furthermore, the chemical shift scoring terms computed using SPARTA or SHIFTX are non-differentiable, and therefore, the restraints derived from chemical shifts cannot be used directly to perform a uniform exploration of the conformational phase (/ and w) space. To address this bottleneck, several research groups have incorporated chemical shifts directly as differentiable distance restraints [ 44 , 97 , 98 ]. Notably, CamShift applies chemical shift restraints during MD simulations (Chemical Shift restrained Molecular Dynamics (CS-MD)) [ 44 , 97 , 98 ]. Towards this end, Cam-Shift models NMR chemical shifts as polynomial functions of interatomic distances, deviations from random coil values, dihedral angles, ring current and hydrogen bonding effects. The distance-dependent term of the CamShift objective function is contributed by backbone, sidechain, and through-space atom pair correlations as highlighted by Eq. (3) :
Here, X ϵ {backbone, sidechain, through-space} , distance ij is the distance between atoms i and j , α ij and β ij are the parameters derived from known structures with assigned backbone chemical shifts in the database [ 99 ]. δ backbone captures the distance between a query atom and backbone atoms of the nearby residues along with additional distances contributed by the backbone atom pairs of neighbors. δ sidechain is used to acquire distance between query atom and the sidechain atoms of that residue. Lastly, δ through-space allows attainment of distances of all the atoms within 5 Å of a query atom excluding backbone atoms of the query residue and the neighboring residues that are obtained while calculating δ backbone .
In this approach, during every integration step of the MD simulation, an overall potential function is calculated by taking a difference between the CamShift-predicted chemical shifts and the experimental shifts. Here, CamShift computes the forces by directly evaluating the derivatives of the chemical shift potential with respect to the various interatomic distances in Eq. (3) , along the x, y and z coordinates [ 44 ]. Since MD simulations are carried out in the Cartesian coordinate system, the size of the system becomes a limiting factor for larger proteins. Therefore, combining such methods to perform refinement after the generation of an initial set of starting models computed quickly using existing fragment assembly approaches (such as CHESHIRE and CS-Rosetta) provides a promising avenue towards modeling larger, more complex protein folds [ 97 ].
As stated by Anfinsen’s postulate, the sequence of amino acids in a protein contains sufficient information to determine its fold [ 100 ]. By extension, two or more evolutionarily related (or homologous) proteins that share considerable amino acid sequence similarity likely also have comparable 3D structural features. Classical methods, including MODELLER [ 101 ], I-TASSER [ 55 ] and threading protocols in Rosetta [ 102 ], have achieved success in performing homology (also referred to as comparative or template-based) modeling even when the similarity between the query and template sequences is low (up to 20–25%) [ 103 ]. Nonetheless, these methods generally require a high (>40%) degree of sequence similarity to the template to obtain reliable models for larger proteins. Alternative approaches combine information drawn from evolutionarily related proteins (or templates) with sparse experimental data to overcome the drawbacks of classical methods. In particular, NMR chemical shifts can supplement sequence information to guide template identification and alignment at lower sequence similarity levels, thereby helping to alleviate a problem that has plagued comparative modeling since its inception [ 104 ]. There are now robust approaches that have employed this concept, each unique in the way it selects templates and extracts restraints in order to model the structures of larger proteins with high accuracy.
The first method which combined comparative modeling algorithms with backbone and 13 C β chemical shifts to derive consistently accurate models of protein structures is CS-HM-Rosetta (Chemical Shift-Homology Modeling-Rosetta) [ 104 ]. In this approach, classical CS-Rosetta ab initio calculations are used together with evolutionary distance restraints [ 105 ] derived from homologous proteins (or templates with ~30% sequence identity) in the PDB to bias the search towards solutions that are consistent with both (i) the fold in template structure(s) and (ii) the backbone chemical shifts ( Fig. 7 , blue). In this way, the chemical shift data are used as a means to distinguish high quality alignments from incorrect alignments, both locally and globally. Relative to conventional comparative modeling protocols, this enables accurate template-based modeling in spite of low sequence similarity levels.
Methods for structure determination using restraints derived from evolutionary information and NMR chemical shifts. (Blue) Flow diagram of CS-HM-Rosetta [ 104 ]. In CS-HM-Rosetta, the query sequence is aligned to the sequences of template (evolutionarily related) protein structures extracted from the PDB. A set of distance restraints from the template structures are derived using Gaussian probability densities (silver) for every pair of Cα atoms in the sequence of the query protein. These distance restraints are used along with sparse NMR data (NOEs, RDCs), for chemical shift fragment-based structure determination by CS-Rosetta. (Gray) Flow diagram of CS-RosettaCM/Pomona [ 109 ]. CS-RosettaCM/Pomona obtains chemical shift-derived torsion angles for a query protein using the backbone chemical shifts, and then performs pairwise alignment to the torsion angles and sequence of template structures in the PDB using a dynamic programming algorithm. Following pairwise alignment, possible template structures are selected and clustered. The representative templates are then selected from these clusters. Together with sparse NMR restraints, the filtered structures serve as templates for the CS-RosettaCM protocol. (Dark red) Flow diagram of EC-NMR [ 128 ] and RASREC-Rosetta with evolutionary restraints [ 127 ]. Here, a multiple sequence alignment is constructed using the query sequence and many template sequences with unknown structures. Related residues in space the exhibit covariance in the sequence alignment are then identified using statistical algorithms to derive structural restraints, termed evolutionary coupling (EC) restraints. These EC restraints are combined with NMR data (such as chemical shifts, NOEs, RDCs) and input to standard RASREC-Rosetta or CYANA for structure calculation.
In order to derive long-range structural restarints, CS-HM-Rosetta uses a probabilistic approach to establish a relation between a diverse input of pair-wise alignments to template sequences, with features in the corresponding template structures. Here, (i) all proteins in the PDB are aligned to the query sequence using HHSearch (HMM-HMM search; HMM: Hidden Markov Model) [ 106 ] where the alignment criterion is based on the predicted secondary structure of the query sequence versus the template secondary structure (via DSSP [ 72 ]); the alignment pairs with lower e-values are retained, (ii) every pair of residues that are ten or more positions apart along the query sequence is considered; if the distance between the C α atoms of the corresponding residues in the template structure is within 10 Å, then it is used to compute a multi-basin C α distance constraint, (iii) the joined distribution of distances obtained from all alignments is analyzed against a set of four alignment quality features, including the HHSearch e-value (local sequence similarity) [ 106 ], the BLOSUM62 (Block Substitution Matrix) score [ 107 ] for the aligned residue pairs (global sequence similarity), the nearest gap in a query sequence, and the number of Cβ atoms within 8 Å from other Cβ atoms in the template structure (buried surface). Finally, a multi-modal distribution of distance deviations is constructed for every C α atom pair in the sequence, and subsequently converted into a single distance restraint. Therefore, the confidence of each restraint is strengthened by combining distances computed for the same residue pair from multiple template structures, represented as a mixture of Gaussian distributions [ 105 ] ( Fig. 7 , silver).
The distance restraints drawn from evolutionarily related proteins directly influence the convergence and distribution of sampled Rosetta energies in explicit CS-HM-Rosetta calculations. If the input alignments are incorrect either locally or globally, then the derived distance restraints will not be consistent with the experimental chemical shifts, and will yield models with poor convergence and high energies relative to control calculations performed without the use of evolutionary distance restraints [ 108 ]. CS-HM-Rosetta’s ability to model accurate backbone structures (RMSD < 2 Å) and recover high degree (75–85%) of native sidechain rotamers demonstrates that the combination of NMR chemical shifts with evolutionary distance restraints can circumvent the need to analyze NOE data for targets with remote homologs in the PDB. Instead, the NOE data (if available) can be used for structure validation.
As an alternative to conventional sequence and predicted secondary structure-based alignment methods, the CS-RosettaCM/ Pomona (Chemical shift-Rosetta Comparative Modeling/Protein Alignments Obtained by Matching of NMR Assignments) [ 109 ] ( Fig. 7 , gray) protocol relies on the idea that NMR chemical shifts encode local structural homology. The key innovation in this procedure is a protein alignment module, Pomona, which uses TALOS-N [ 110 ] to estimate //w backbone torsion angle probability maps from 13 C α , 13 C β , 13 C γ , 15 N, 1 H α , and 1 H N chemical shifts for every amino acid in the query sequence. These maps are used to compute a substitution score measuring local similarity between the query and a template structure, given by the weighted contributions of backbone torsions, secondary structure and sequence similarity. A pairwise sequence alignment is then performed using a modified version of the Smith-Waterman dynamic programming algorithm [ 111 ], with an objective function that optimizes the substitution score augmented by a gap insertion penalty term [ 112 ]. The resulting alignment is further validated according to the consistency between experimental chemical shifts and SPARTA+ computed chemical shifts (for each residue in the query sequence that aligns to a residue in the database used by SPARTA+). In contrast to classical, sequence-based comparative modeling methods, homologous proteins with sequence identity 2’20% are excluded for the examples used in that study to prevent overfitting. All template structures identified by Pomona undergo normalized C α RMSD-based hierarchical clustering, and the ten top-ranking clusters with respect to alignment score are retained. Finally, the top two representatives from each cluster are used as structural templates for Rosetta’s comparative modeling protocol, RosettaCM [ 113 , 114 ].
More recently, the use of sequence covariance information to infer structural relationships between different pairs of residues along the query sequence has shown great promise for enabling reliable fold identification [ 115 ]. Stemming from the principle that evolutionary coupling correlates well with structural proximity, a growing body of work combines evolutionary data with sparse experimental restraints towards accurate modeling of protein structures [ 115 – 117 ]. Moreover, a global effort towards inferring a reliable network of EC (evolutionary coupling) restraints from fewer homologous sequences has improved the effectiveness of this approach [ 115 , 116 , 118 – 124 ]. These methods typically rely on global statistical methods, such as pseudo-likelihood maximization (PLM) [ 122 ] and/or direct information (DI) [ 125 ], and more recently deep learning methods [ 126 ] to identify relevant sequence features for robust identification of residue contacts. As a general rule, these methods first perform a HMM profile-based multiple sequence alignment (MSA) of the evolutionarily related protein sequences. Following MSA, a covariance matrix between all pairs of residues in the query sequence is created. The inverse of the covariance matrix provides conditional mutual information, which allows estimation of residue-residue contacts. Even though these methods exhibit high accuracy in predicting true structural contacts (>80% true positive rate among the top 50 predictions [ 122 , 125 ]), they also have a high false positive rate. Therefore, the extent to which such heterogeneous sets of restraints can be used to guide protein modeling calculations depends on the use of advanced sampling protocols, along with experimental data which can in principle distinguish correct from incorrect EC restraints on the basis of the calculation outcome.
The incorporation of EC restraints together with NMR chemical shifts within robust sampling protocols shows great promise towards identifying the native folds of larger proteins. As described earlier, RASREC-Rosetta has a high degree of accuracy and precision in modeling protein structures with complex folds, in the face of sparse experimental data and erroneous restraints. More recently, RASREC-Rosetta was extended to employ evolutionary contacts in addition to NMR chemical shifts and available sparse experimental data ( Fig. 7 , dark red) [ 127 ]. In this approach, restraints from evolutionary couplings are obtained using either the PLM or DI scoring methods in EVFold (Evolutionary Fold) [ 115 , 116 ]. NMR chemical shifts complement the EC restraints, by identifying a consistent network of restraints during RASREC-Rosetta calculations, and thus eliminating any structurally unrelated correlations recognized by EVFold. In addition, the energy function in RASREC-Rosetta is further adjusted to account for incorrectly drawn EC restraints [ 127 ].
An alternative, more integrative approach, EC-NMR (Evolutionary Coupling-NMR spectroscopy) combines evolutionary contact information with NMR data within the structure determination program, CYANA ( Fig. 7 , dark red) [ 128 ]. In this approach, EC restraints are inferred from the analysis of MSAs using the jack-hammer algorithm [ 129 ]. NMR data, including backbone and side-chain chemical shifts, NOESY peak lists and RDCs are recorded for ILV-methyl labeled protein samples [ 130 – 132 ]. Briefly, the EC restraints are combined with the previously assigned backbone and sidechain NMR chemical shifts, and used to assign the NOESY cross-peaks using the ASDP program [ 133 ]; these assigned restraints are then used in the full simulated annealing structure determination protocol in CYANA [ 95 ]. Here, the correct EC restraints and unambiguous NOESY assignments form a reliable network of contacts which helps in resolving ambiguities in the remaining NOESY assignments and in eliminating possible false positive EC restraints. Finally, the full set of assigned NMR restraints and evolutionary couplings are used to refine the preliminary CYANA models using Rosetta’s all-atom energy function [ 134 ].
Protein complexes constitute over 50% of the proteome and participate in very many important biological processes [ 135 ]. NMR has enabled structural studies of such systems in vitro [ 136 ]; however full structure elucidation is challenging due to their large size and the presence of dynamics as well as the effects that become more pronounced at the interface between different subunits, which ultimately lead to exchange-induced line broadening of the NMR resonances [ 137 ]. An existing method, HADDOCK (High Ambiguity Driven Docking) (reviewed in [ 138 ]), makes use of chemical shift perturbations to model the structures of protein complexes [ 139 – 142 ]. Specifically, the differences in backbone and sidechain chemical shifts upon complex formation are used to derive ambiguous distance restraints [ 143 , 144 ], which further guide the docking of monomeric subunits under the assumption that the changes are localized to the binding surface(s). Similar to other semi-flexible docking methods, a major challenge remains in addressing any conformational changes that occur upon complex formation, for instance in domain-swapped protein assemblies and systems with more complex topologies.
RosettaOligomers leverages chemical shift fragments within Rosetta’s docking protocols to perform de novo modeling of symmetric oligomers [ 137 ]. This approach relies on CS-Rosetta to generate structures of monomeric subunits from sequence information, NMR chemical shifts and sparse NOE restraints (if available). In one branch of the protocol, the oligomers being modeled are assumed to contain relatively simple interfaces in which the monomers do not entwine significantly, and therefore the predicted subunits can be used in their free states. All models in the low-energy ensemble computed for the monomeric subunits are then docked together using sparse RDC restraints and user-defined symmetry information [ 145 ] ( Fig. 8 , Pathway 1). Another branch of this protocol makes use of more elaborate (and computationally demanding) fold and dock calculations to address cases of domain-swapped, or interleaved oligomeric proteins [ 146 – 149 ]. These cases can be diagnosed on the basis of the initial CS-Rosetta calculations performed for the monomeric subunits: if the resulting structural models exhibit divergence (>3 Å) after ab initio folding, then the oligomeric complex is likely to be inter-leaved ( Fig. 8 , Pathway 2). Although this method was developed originally to model symmetric domains, it can be extended to accommodate asymmetric domains [ 149 , 150 ]. RosettaOligomers was recently integrated with RASREC-Rosetta. This extended protocol uses NMR chemical shifts, RDCs and SAXS (Small-angle X- ray scattering) data to model larger complexes, as was demonstrated for a 33 kDa dimer target [ 151 ].
Structure determination of protein complexes with RosettaOligomers guided by chemical shifts. Flow diagram for modeling protein complexes with RosettaOligomers [ 137 ] using chemical shifts, sparse NMR restraints derived from NOEs, RDCs, SAXS data sets, and user-specified symmetry definition. Pathway 1, designed to address oligomers from independently folded monomers (PDB ID 1C77, left): (i) CS-Rosetta produces a structural ensemble for each monomeric subunit and (ii) monomers are docked using protein–protein docking protocols in Rosetta [ 145 ]. Pathway 2, designed to address domain-swapped oligomers (PDB ID 2K5J, right): (i) chemical shift-derived backbone fragments, together with sparse NMR restraints, are used in one step with the fold and dock protocol [ 149 ]. Both approaches can be used in either a fully symmetric using Rosetta’s symmetry interface [ 249 ] or asymmetric mode.
CamDock performs ab initio modeling of protein complexes using the Chord program [ 152 ] which is based on the HEX [ 153 ] approach in its use of a spherical harmonics-based representation of protein surfaces. Here, backbone chemical shifts are used together with CHESHIRE’s molecular dynamics refinement strategy, as described in Section 4. CamDock was used to model E9-Im9, a 60 kDa protein complex, which resulted in a structural ensemble that is very close (1.18 Å C α RMSD) to the reference X-ray structure [ 152 ]. In a more recent work focusing on a Ztaq: Anti-ZTaq protein complex containing 144 amino acids in total [ 154 ], the CHESHIRE procedure was first applied to model the monomeric subunits in their bound states, which were then docked as rigid bodies using a protocol akin to CamDock. The docked protein complex is further optimized by CHESHIRE’s hybrid (MD/Monte Carlo) refinement protocol using an objective function that captures experimental and predicted chemical shifts together with molecular mechanical force fields (see Section 4). As a result of these key innovations, the combined CHESHIRE/CamDock approach generated structural models within 1 Å (backbone RMSD) from the reference X-ray structure [ 154 ].
Chemical shifts also constitute unique NMR observables in modeling the structures of biologically relevant, sparsely populated transient protein and nucleic acid conformations (termed ‘dark’, ‘invisible’ or ‘excited’ states) [ 155 , 156 ]. Such measurements are made possible by the development of a suite of NMR experiments to probe excited states with lifetimes in µs–ms (microsecond-millisecond) timescale. In particular, PRE/PCS measurements [ 157 ] are useful for cases of fast conformational exchange; rotating frame R 1p relaxation [ 158 ] and Carr–Purcell– Meiboom–Gill (CPMG) dispersion [ 159 ] for intermediate exchange; and chemical-exchange saturation transfer (CEST) [ 160 ] for slow exchange. The power of these methods is highlighted in key applications for the FF domain of human HYPA/FBP11 [ 161 ], Fyn SH3 domain [ 162 ], T4 lysozyme [ 163 ], HIV (human immunodeficiency virus)-1 transactivation response element RNA (Ribonucleic acid) [ 164 ], ubiquitin [ 165 ], Ca 2+ sensor signaling protein calmodulin [ 166 ], a transcriptional riboswitch [ 167 ], and E. coli enzyme dihydrofolate reductase [ 168 ]. These examples assume a system in which conformational exchange occurs between two states. Whereas multi-state exchange models have been explored by several research groups, they are usually limited to three states to avoid overfitting of the NMR data [ 156 ].
Recently, the integration of chemical shifts derived from the fitting of relaxation dispersion data with methods such as CS-Rosetta has enabled modeling of sparsely populated protein conformations. In these studies, typically, a series of CPMG dispersion experiments recorded at multiple magnetic fields and temperatures provide insights into the excited-states by fitting populations, chemical shift differences ( Δω ), and exchange rates ( k ex ) for the major (ground-state) and minor (excited-state) conformations [ 169 ]. CS-Rosetta has been employed together with backbone 1 H, 15 N, 13 C chemical shifts and amide RDCs to model the excitedstate conformations of folding intermediates of either (i) a T4 lysozyme mutant [ 163 ] or (ii) a HYPA/FBP11 FF domain [ 161 , 170 ]. Alternatively, excited-state structures can be elucidated using paramagnetic NMR restraints provided by PRE and PCS (pseudocontact shift) measurements [ 156 ]. Specifically, PCS restraints have been used for structure determination of a transient thioester intermediate formed between Staphylococcus aureus sortase A (SrtA) and a substrate peptide, which was inaccessible to traditional structure determination methods due to its short lifetime [ 171 ]. In that study, structural restraints for the SrtA-peptide intermediate were acquired by labeling SrtA with paramagnetic lanthanide tags which enabled the detection of 407 PCS restraints used for structure calculation in Xplor-NIH [ 172 ].
Despite undeniable progress in chemical-shift-driven structure determination, the majority of studies have focused on information extracted from backbone chemical shifts, often resulting in lower resolution with respect to the orientation of sidechain groups. High resolution modeling of sidechain conformations is of considerable interest towards understanding protein function with respect to enzyme catalysis [ 173 ], protein interactions modes [ 174 ] and folding [ 175 ]. In the absence of sidechain chemical shift measurements, the most probable orientations of sidechains can be inferred from their lowest-energy conformations sampled from existing rotamer libraries [ 176 – 180 ]. Many ab initio structure determination approaches utilize such rotamer libraries to model static sidechain conformations using Monte Carlo-based optimization. Similar to protein backbone modeling using experimental and predicted chemical shifts, modeling sidechain conformations can be significantly improved by the use of 13 C chemical shifts. Towards this end, chemical shift prediction methods, such as CH3SHIFT [ 181 ], can help guide the rotamer selection and structure refinement processes. In practice, the utility of such methods is limited due to the difficulty , of predicting the γ-gauche effect, where the 13 C chemical shift of a given nucleus is influenced by its position relative to γ -substituents [ 182 ], along with the observation that sidechain conformations may be constrained in X-ray structures relative to a solution environment.
In solution, sidechain rotamers sample an ensemble of functionally relevant states, which can be unveiled –in principle– by a full analysis of NMR chemical shifts. Here, methyl groups, typically found at the hydrophobic core of proteins, have favorable relaxation properties and their resonances are useful when studying larger, more complex systems [ 174 ]. While stereospecific characterization of methyl groups is generally difficult to achieve using uniformly labeled samples, the use of stereospecific isotopic methyl labeling schemes (employing precursors that lead to pro-R and pro-S labeled leucine or valine residues [ 183 ]), can help distinguish these groups, even for larger targets. This can in turn aid in capturing different rotamer configurations for leucines and valines [ 184 ]. For example, determination of sidechain rotameric states for leucine Cd1/Cd2 groups, which can sample trans, gauche + or gauche - conformations, can be performed using measurements of chemical shift differences between stereo specifically assigned methyl groups (ΔCδ12 = δ ( 13 C δ1 ) – δ ( 13 C δ2 )) [ 185 ] or empirical 3 J CC ( 13 CH 3 – C α ) scalar bond couplings [ 186 , 187 ]. The former was demonstrated through a clear correlation between 13 C sidechain chemical shifts, χ 1 /χ 2 dihedral angles and rotamer conformations observed in high resolution structures [ 185 ]. In addition, a linear combination of ΔCδ 12 and empirical 3JCC scalar coupling values proved useful for the interpretation of more dynamic leucine rotamer populations for calbindin D9k [ 187 ]. This analysis was facilitated by the fact that the leucine χ 2 dihedral angle primarily samples trans and gauche + conformations in solution [ 188 ]. Simultaneously, isoleucine sidechain χ 2 rotamer conformations can be determined from chemical shifts [ 189 ] or J-coupling [ 190 ] measurements. Although isoleucine χ 2 rotamers can sample all four (trans, gauche + , gauche _ , gauche 100 ) distinct conformations, based on analysis of high resolution X-ray structures, only the trans and gauche _ conformers are populated in solution [ 189 ]. A similar approach has been applied for elucidating the χ 3 rotamer of the methionine Ce methyl group [ 191 ]. The situation is more complicated for valine because its sidechain χ 1 can sample multiple rotamer states (trans, gauche + or gauche _ ) in solution. Here, each valine χ 1 rotamer is derived from fitting measured 13 C γ1 /13 - C γ2 chemical shifts to a set of 20 χ 1 dihedral angles, allowing for accurate estimation of trans, gauche + and gauche _ rotamer populations [ 192 ].
In practice, these approaches have been applied to accurately determine methyl sidechain conformations in sparsely populated excited-states through the measurement of chemical shifts via CPMG relaxation dispersion experiments [ 188 , 189 ]. Measurement of methyl 13 C chemical shifts has also shown success in solid-state NMR studies [ 193 ]. Finally, the focus of sidechain modeling has primarily assumed a single rotameric state rather than a distribution of states. A major step towards addressing this problem has been the implementation of a curated database consisting of extensive dynamic sidechain rotamers sampled from the MD simulations of known protein structures [ 194 ]. However, incorporating the information content of dynamic sidechains requires computing long MD simulation trajectories, which can be limiting for larger systems.
While the use of chemical shifts in structural studies of nucleic acids is a fairly mature field (reviewed in [ 194 ]), the application of shifts towards full structure determination is relatively new [ 196 ]. Unlike proteins, the relation between NMR chemical shifts and corresponding nucleic acid structures is difficult to discern due to short dispersion range of shift values, and the limited availability of assigned chemical shifts in the BMRB, that has stymied the development of automated methods [ 195 , 197 ]. Nevertheless, 1 H chemical shifts still provide powerful restraints that can distinguish native from non-native nucleic acid conformations [ 198 ]. Furthermore, approaches have been developed to assign [ 199 ] or predict [ 200 – 203 ] chemical shifts to aid nucleic acid structure determination, refinement and validation.
De novo modeling of RNA structures containing non-canonical regions is challenging and several research groups have attempted this problem with limited success [ 204 – 207 ]. More recently, two complementary approaches, FARFAR (Fragment Assembly of RNA with Full Atom Refinement) [ 208 ] and SWA (Stepwise Assembly) [ 209 ] have exhibited favorable outcomes in modeling non-canonical regions using realistic force fields. In particular, FARFAR employs a fragment assembly approach [ 204 ] to model low resolution structures following full-atom refinement using a high resolution energy function. Alternatively, SWA builds each nucleotide in a stepwise manner recursively, where each step involves exhaustive enumeration of all possible conformations of the new residue. Although both these methods address conformational sampling bottlenecks that typically arise during RNA structure modeling, for more complicated cases (such as the UUAAGU hexaloop from 16S ribosomal RNA), the energy function does not provide sufficient discrimination of the native state [ 210 ]. These results spurred the development of CS-Rosetta-RNA (Chemical Shift-Rosetta-Ribonucleic Acid), where 1 H chemical shifts are exploited to perform de novo modeling of RNA structures ( Fig. 9 ) [ 210 ]. In this method, FARFAR and SWA are used in parallel to sample a large number of plausible RNA conformations. The resulting RNA structural models are energy minimized and ranked using an adjusted all-atom energy function according to Eq. (4) :
Here, E Rosetta is the standard all-atom energy function in Rosetta [ 208 ] without using chemical shifts δ i e x p and δ i c a l c are experimental and back-calculated non-exchangeable proton chemical shifts obtained using NUCHEMICS [ 200 , 202 ], and c is a weight factor. As shown by the results on a benchmark set of 23 targets, CS-Rosetta-RNA successfully demonstrates that the introduction of chemical shift-based terms in the high resolution potential drives the simulations towards a global, native-like minimum in the energy landscape. These approaches hold great promise for the modeling of protein/RNA complexes in future methods.
Modeling RNA structures using CS-Rosetta-RNA. The query RNA sequence is used by specialized Rosetta protocols, FARFAR [ 208 ] and SWA [ 209 ] to construct a large number of plausible RNA structures. The predicted RNA structures are filtered using a combination of standard Rosetta energy function terms together with a penalty function which measures the difference between experimental and back-calculated chemical shift values ( Eq. (4) ).
While chemical-shift based approaches offer an opportunity to determine the structures of proteins de novo , the main driving forces of conventional structural determination protocols by NMR are NOE measurements. NOE connectivities form a network of inter-proton distance restraints (typically within 6 Å in the 3D structure) that can be used directly in structure determination. Typically, hundreds to thousands of NOE restraints are required to define backbone and sidechain orientations during the process of structure modeling [ 15 , 16 , 211 ]. However, acquiring such restraints is a labor-intensive activity that involves analyzing and interpreting hundreds of cross-peaks in the NOESY spectra. NOE cross-peaks can be assigned to atom pairs in the protein sequence through the accurate mapping of proton chemical shifts to the cross-peak coordinates. The problem of ambiguity arises rapidly during this mapping process, as a result of spectral overlap. Therefore, automatic NOESY assignment and structure refinement has been an iterative process, which typically relies on highly complete (>90%) and accurately assigned NMR chemical shifts. Here, an initial, self-consistent network [ 212 ] of relatively unambiguous NOE restraints (of the order of 100, depending on target size and degree of spectral overlap) are drawn from the more unique mappings, to generate an initial set of low-resolution structures [ 15 ]. The low-resolution structures from early stages are then used to reduce uncertainty in the remaining unassigned or ambiguously assigned NOE restraints [ 144 , 213 ]. This general concept laid the foundation for the majority of NMR structure determination programs [ 7 , 15 ]. In these approaches, backbone dihedral angle restraints derived from chemical shifts using TALOS and similar methods can play an auxiliary role in biasing the search towards more native-like conformations that successively help assign more long-range NOEs
Several excellent reviews discuss the internal workings of many successful NOE assignment and structure refinement approaches [ 7 , 15 ]; here, we focus on fragment-based approaches that offer an opportunity to further explore this concept through a more optimal use of chemical shifts as a means to improve sampling of native-like conformations. For instance, Auto-NOE-Rosetta (Automatic NOESY Assignment-Rosetta) [ 214 ] leverages the powerful RASREC-Rosetta sampling engine (see Section 4) together with an iterative NOE assignment algorithm which uses network anchoring [ 212 ], agreement with a pool of preliminary models (already built into the RASREC algorithm), and presence of symmetry-related cross-peaks. The assigned NOEs are used to derive distance restraints at various ambiguity levels [ 144 ]. Low-confidence, ambiguous restraints can be combined with highly unambiguous restraints [ 144 , 215 ] and used within eight distinct conformational sampling stages by Auto-NOE-Rosetta.
To demonstrate the improved performance of chemical shift fragment-based approaches in NOE-driven structure determination, we performed new structure calculations of the 198 aa α-lytic protease (aLP) protein from sequence information alone using Auto-NOE-Rosetta and compared them with PDB deposited models determined with the help of chemical shifts (PDB ID 5WOT) [ 216 ] ( Fig. 10A and B ). From this comparison, it can be observed that the models obtained using chemical shift fragments have lower energies, exhibit higher convergence and are closer to the X-ray structure (PDB ID 1P01) ( Fig. 10A–C ). Further analysis of NOE assignments in the models for both scenarios revealed, as expected, very low (~30%) recovery of native residue pair contacts for sequence based fragments ( Fig. 10D ) relative to the contacts obtained using chemical shift fragments (~70%). Moreover, Auto-NOE-Rosetta successfully assigned approximately three times more NOE restraints per residue in the structural ensemble calculated using chemical shift fragments as opposed to those obtained using only the sequence fragments ( Fig. 10E ). This result stems from the sampling of native-like structures during the early stages of the protocol, which in turn helps assign more long-range NOE restraints. Hence, our comparison illustrates a high degree of synergy between chemical shifts and NOE structural restraints in driving CS-Rosetta structure calculations.
Synergy between chemical shift-based fragments and automated NOE assignments. ( A ) Ten lowest-energy structures of 198 aa a-lytic protease protein computed by AutoNOE-Rosetta from fragments derived using the protein amino acid sequence together with manually assigned NMR backbone chemical shifts (PDB ID 5WOT) [ 216 ]. ( B ) Ten lowest-energy structures of the same protein computed by AutoNOE-Rosetta using fragments derived solely from the amino acid sequence. ( C ) Energy (in R.E.U., Rosetta Energy Units), RMSD (in Å) to X-ray structure (PDB ID 1P01) and Convergence (in %) of the ten lowest-energy structures computed by AutoNOE-Rosetta with (purple) and without (red) using chemical shifts (gray arrows). Convergence of the structures is as shown in the gradient scale to the right. ( D ) NOE contacts are defined as a function of residue pairs. Upper triangular region represents long-range (at least 5 residues apart) NOE contacts identified by AutoNOE-Rosetta for two independent calculations; first, performed by applying structural fragments derived from the protein amino acid sequence (red), and second, using chemical shift fragments (silver). The lower triangular region represents long-range NOE contacts predicted between all possible protons in the X-ray (PDB ID 1P01), using a 5.5 Å distance threshold (green). ( E ) Number of NOE restraints assigned by AutoNOE-Rosetta for each residue in the ten lowest-energy structural models computed with (silver) and without (red) using chemical shift-based fragments. Data obtained from [ 216 ].
The accuracy of all chemical shift-based structure modeling methods addressed so far depends, to a large extent, on both the correctness and completeness of the input chemical shift assignments. In recent years, there has been a surge of development in methods for automated chemical shift resonance assignments of both backbone and sidechain atoms, often with very high levels of accuracy (reviewed in [ 16 , 217 ]). Such algorithms have become integral components of structure calculation protocols employing NOEs and/or RDCs. While the majority of chemical shift assignment algorithms operate on the basis of a large number (6–10) of complementary NMR spectra, an effort towards reducing the number of input spectra is driven by the need to simplify and further automate the NMR structure determination process.
With this in mind, 4D-CHAINS, an automated procedure, was developed recently to assign backbone and sidechain chemical shifts using two complementary 4D (TOCSY and NOESY) spectra, recorded in fully protonated samples [ 216 ]. 4D-CHAINS uses 2D probability density maps of correlated 13 C– 1 H chemical shifts to identify spin systems (termed amino acid index groups or AAIGs) in the input 4D data. Thereafter, AAIGs are mapped to amino acids in the query protein sequence using a procedure similar to genome assembly used in DNA sequencing [ 218 ]. During this process, contiguous segments of AAIGs are iteratively matched along the protein sequence until a self-consistent assignment solution is obtained. The high levels of accuracy and completeness of 4DCHAINS ( Fig. 11A ) allow it to be combined with NOE assignment and structure determination algorithms, such as AutoNOE-Rosetta. The practical utility of the combined 4D-CHAINS/ AutoNOE-Rosetta protocol was demonstrated recently through the structure calculation of aLP ( Fig. 11B ) [ 216 ]. Here, 4D-CHAINS assigned chemical shifts together with two unassigned NOESY peak lists were provided as input to AutoNOE-Rosetta. The combined protocol (i) generated structural models within 1.3 Å from the reference X-ray structure (PDB ID 1P01) ( Fig. 11B ) and (ii) captured two-thirds of the crystallographic NOE contacts across the entire protein suggesting good recovery of near-native folds ( Fig. 11B–D ) [ 216 ]. Together, the 4D-CHAINS/AutoNOE-Rosetta approach forms a complete, automated pipeline for NMR structure determination from a minimal set of spectra.
Automated chemical shift assignment and structure determination of a-lytic protease using 4D-CHAINS/AutoNOE-Rosetta. The use of 4D-CHAINS/AutoNOE-Rosetta pipeline [ 216 ] is illustrated for a 20 kDa, uniformly 13 C, 15 N-labeled protein with a highly complex β-fold topology. ( A ) 4D-CHAINS produces reliable assignments at completeness levels (~93%) which exceed the minimum required (~70%) by AutoNOE-Rosetta to converge on the correct fold using simulated peak lists. First, 4D-CHAINS assigns 77% of all observed backbone and sidechain chemical shifts using a 4D HC(CC–TOCSY(CO))NH experiment (dark green). Second, correct assignments are automatically extended by an additional 13% using common NOEs in a 4D 13 C, 15N-edited HMQC-NOESY-HSQC (Heteronuclear Single Quantum Coherence) experiment (light green). The full method has a combined 1.9% error rate (red), and does not consider the resonances of aromatic or sidechain amide groups (silver), which can be readily obtained manually using the automated assignments as a guide. ( B ) Ensemble of ten lowest-energy structures calculated using AutoNOE-Rosetta, superimposed on the X-ray reference structure (PDB ID 1P01). Average RMSD to X-ray: 1.3 Å (computed for backbone atoms over core secondary structure regions). ( C ) NOE contacts defined for residue pairs along the sequence of α-lytic protease. The upper triangular region represents NOE contacts identified by AutoNOE-Rosetta using chemical shifts assigned by 4D-CHAINS and two complementary 4D NOESY peak lists (HCNH and HCCH). The lower triangular region represents all degenerate 1 H NOE contacts predicted from the X-ray structure using a 5.5 Å distance threshold. ( D ) Comparison of the total number of NOE contacts between amide-amide, amide-aliphatic and aliphatic-aliphatic protons assigned by AutoNOE-Rosetta and NOEs predicted from the X-ray structure as described in ( C ). All structure diagrams were prepared using PYMOL ( https://pymol.org/2/ ).
Classical NOE-based approaches for NMR structure determination rely on the analysis of short- to medium-range (<6 Å) 1 H– 1 H distance restraints [ 219 ]. These local NOE connectivities are typically complemented with more global restraints obtained from measurements of RDCs, PREs and PCSs [ 220 , 221 ] during the final stages of structure refinement. More recently, the use of such ‘‘global” restraints, in conjunction with chemical shift fragment based approaches, has proven to be a powerful combination to alleviate or reduce the requirement of NOEs for modeling protein structures. Here, we briefly discuss the utility of RDC-, PRE-, and PCS-derived restraints in chemical shift-based structure determination.
RDC measurements report on global orientations between inter-nuclear bond vectors with respect to an overall alignment frame [ 31 , 220 , 222 ]. RDCs are highly sensitive structural parameters, therefore their application during structure refinement and validation can help not only to identify the overall protein fold, but also to pinpoint detailed structural features, such as the precise equilibrium length of bonds [ 223 , 224 ] or deviations from planarity in the peptide group [ 225 ]. However, these high resolution applications of RDCs are limited to smaller proteins. Normally, RDC restraints have been employed within de novo structure determination protocols of various levels of complexity [ 226 ]. Due to the degeneracy of RDC values with respect to the underlying orientation of inter-nuclear vectors, multiple independent datasets recorded using different alignment media are required in order to define a uniquely preferred orientation [ 226 ]. More recently, significant progress has been made in the development of automated structure determination approaches guided by chemical shifts and or sparse RDC restraints [ 7 , 226 ]. In all these methods, RDCs offer a highly complementary source of structural information to the backbone chemical shifts; while chemical shifts are very sensitive to the local backbone structure, RDCs help define long-range structural features, particularly the orientation of different secondary structural elements and individual domains within the structures of multi-domain proteins. This was recently demonstrated through RASREC-Rosetta calculations, where the use of amide RDCs in conjunction with backbone chemical shifts and sparse amide NOEs enabled structure determination of targets up to 25 kDa [ 81 ]. Finally, RDCs together with chemical shifts offer an opportunity for self-consistent cross-validation of NMR structures [ 227 ], which becomes particularly relevant in the face of sparse datasets.
PRE restraints are obtained through a quantitative analysis of 15 N and 13 C relaxation rates in samples containing paramagnetic tags, typically attached via site-specific labeling approaches, relative to a diamagnetic reference sample. Here, the conjugation of nitroxide spin labels to engineered disulfides in proteins has been particularly useful. Alternatively, solution PREs have been widely adopted for structure modeling applications allowing for de novo structure determination of large proteins (40–100 kDa) in the absence of abundant long-range NOE restraints [ 221 , 228 , 229 ]. For instance, solvent (s)PRE-CS-Rosetta [ 229 ] makes use of the global fold information encoded within sPRE restraints (i.e. distance measurements between a paramagnetic solute and the protein surface) and chemical shifts to model protein structures. In the sPRECS-Rosetta protocol, amino acid sequences together with NMR chemical shifts are used to generate backbone fragments, which are subsequently assembled to produce low-resolution structural models (see Section 4). These low-resolution models are further used to back-calculate the sPRE effect for comparison against experimental data, and additionally to compute the sPRE-based score which is used to adjust the energy function. This approach leverages a fast, grid-based method for sPRE computation during the low-resolution stage of ab initio fragment assembly. Thus, the use of sPRE restraints complements chemical shift-based fragments by biasing the collapse of the polypeptide chain towards more native-like conformations [ 229 ].
PCS measurements provide structural restraints derived by measuring changes in chemical shift values due to the presence of a paramagnetic metal ion [ 230 ]. In contrast to NOEs and PREs, which show a r –6 dependence, PCSs display a r −3 distance dependence, allowing for comparatively longer distance measurements between atoms (up to 40 Å) [ 221 ], in conjunction with their orientational dependence that can provide a further powerful source of structural discrimination. Therefore, PCSs are used by protein structure determination algorithms to obtain global fold information [ 231 – 234 ].While PCSs are extensively utilized during docking or structure refinement, their use is limited during de novo modeling because the tensor parameters used to calculate PCS distance restraints depend on atomic coordinates. PCS-Rosetta extends from Rosetta’s ab initio algorithm and makes use of chemical shift derived fragments together with a low-resolution energy function adjusted according to the PCS score (computed using experimental PCS data) [ 232 ]. The PCS-based score term is obtained by interleaving a grid search, which defines the position of the paramagnetic tag, with a singular value decomposition to fit the five tensor parameters. Following low-resolution stage, sidechains are introduced and refined using a full-atom energy function augmented by the PCS score. This protocol has been recently expanded to include paramagnetic tags located at multiple sites with the aim of enabling more robust structure determination of smaller proteins [ 233 ].
As highlighted by these efforts, the combination of RDCs, PREsand PCSs with local structural restraints obtained from NMR chemical shifts provides a powerful approach towards modeling protein structures with high accuracy, using very sparse or, in some cases, no NOE data. This can be a valuable tool for the study of membrane proteins [ 235 , 236 ], and proteins in the solid-state [ 237 ].
All chemical shift-based structure determination methods discussed in this review are available publicly via web servers or downloadable software packages ( Table 2 ). Detailed manuals are available for most methods, rendering them easy to use by users with a minimal background in UNIX operating systems. Most de novo prediction methods employ fragment assembly to model monomeric or oligomeric protein structures, with various computational requirements owing to the complexity and parallelization of the corresponding protocols. While the fragment selection step itself has very modest computational cost (30–40 min on a commodity machine, depending on the target size), and can be run in parallel using the MPI build of Rosetta, ab initio structure refinement is a more demanding task.
Availability of a subset of the chemical shift-based methods used for structuredetermination discussed in this review.
Method | Availability (website) | Web server support | Platforms supported |
---|---|---|---|
CHESHIRE | Available from authors | ||
CS-Rosetta | UNIX-based systems | ||
RASREC-Rosetta | UNIX-based systems | ||
CS-MD | Available from authors | ||
CS-HM-Rosetta | UNIX-based systems | ||
Pomona | UNIX-based systems | ||
RASREC-Rosetta with EC restraints | UNIX-based systems | ||
EC-NMR | UNIX-based systems (and Windows for several steps of the protocol) | ||
RosettaOligomers | UNIX-based systems | ||
CamDock | Available from authors | ||
CS-Rosetta-RNA | UNIX-based systems | ||
AutoNOE-Rosetta | UNIX-based systems | ||
sPRE-CS-Rosetta | UNIX-based systems | ||
PCS-Rosetta | UNIX-based systems | ||
GPS-Rosetta | UNIX-based systems | ||
4D-CHAINS | UNIX-based systems and Windows |
As a representative example, we compared total runtimes as a function of number of processors for the 20 kDa aLP target using three main approaches, CS-Rosetta, RASREC-Rosetta and AutoNOE-Rosetta. As input, these protocols were given amino acid sequence along with three- and nine-residue fragments derived using NMR chemical shifts [ 216 ]. In addition, two unassigned NOESY (4D HCCH and 4D HCNH) peak lists were provided to AutoNOE-Rosetta to perform NOE assignment alongside ab initio structure determination. CS-Rosetta and RASREC-Rosetta calculations were performed independently using manually assigned NOE constraints. We sampled a total of 10,000 independent CSRosetta structures, whereas RASREC-Rosetta and AutoNOERosetta generated 50–80 batches of 100 structures (depending on the progression of each protocol). The fragment-based approaches generally require 16 or more CPUs (of a commodity computer or a UNIX-based cluster) to yield structures in a reasonable amount of time ( Fig. 12 ). Due to the sampling bottlenecks, it is recommended to run such calculations on 64 or higher number of CPUs for larger (>200 aa) targets.
Performance of chemical shift-based fragment assembly methods. Performance of CS-Rosetta (green), RASREC-Rosetta (orange) and AutoNOE-Rosetta (blue) for a 20 kDa target, aLP, given by their runtimes measured as a function of the number of processors (or CPUs) used for structure calculation. All the runs were carried out using sequence information, chemical shifts assigned by 4D-CHAINS, NOE restraints (for CS- and RASREC-Rosetta) and unassigned peak lists (for AutoNOE-Rosetta) as input. The points on the plot represent independent structure calculations performed by respective methods using various number of processors (16, 32, 64, 100 and 200). The y-axis shows time (in hours) taken by the methods for each calculation, which is bounded by the number of hours considered reasonable (~250 h or 10 days). For CS-Rosetta calculations, 10,000 structures are produced during each run. Similarly, for RASREC-Rosetta and AutoNOE-Rosetta calculations, 50–80 batches of size 100 are produced for every execution.
Comparatively, homology-based approaches carry out a preprocessing step to derive restraints or to identify templates from a set of evolutionarily related proteins. Representative times needed to generate restraints/find templates for aLP using CSHM-Rosetta, Pomona, and EVFold (for RASREC-Rosetta and ECNMR) are approximately 4, 400 and 300 min respectively on a single CPU.
Ever since the first de novo atomic-resolution structure determined by NMR in 1985 [ 238 ], chemical shifts have remained an invaluable tool for spectroscopists towards examining the structure and dynamics of biomolecules for systems up to 1 MDa (Megadalton) in the solution- [ 174 ] and solid-state [ 239 ]. More recently, NMR methods have been applied to determine protein structures within living cells [ 240 ]. In this review, we have outlined a representative subset of several complementary approaches for chemical shift-driven structure determination. The active development of new algorithms and expansion of curated databases has the potential to further improve the robustness and accuracy of chemical-shift based methods, to complement and possibly replace classical methods of NMR structure calculation. While the sensitivity and versatility of chemical shifts for structure determination is highlighted by the sheer number and applicability of available approaches, information provided by chemical shifts alone is largely limited to a description of local geometry [ 4 ]. Thus, hybrid approaches that combine chemical shifts with additional short-range and long-range restraints, such as NOE [ 219 ], RDC [ 220 ], PRE [ 221 ], and PCS [ 221 ] measurements, are expected to further increase the scope, accuracy and resolution of NMR derived structures. The integrated approaches that incorporate NMR chemical shifts with other types of experimental data, such as SAXS [ 151 ], Cryo-EM (Cryo-electron microscopy) [ 241 ], SANS (Small Angle Neutron Scattering) [ 242 ], and EPR (Electron paramagnetic resonance) [ 243 ], will provide additional avenues for structure determination of larger and more challenging systems.
At the same time, automated methods have streamlined the chemical shift assignment procedure, allowing for structure determination of small to moderate sized proteins (up to 200 residues,~22 kDa) with minimal intervention by the user [ 7 , 217 ]. Progress in automated NMR structure determination will enable a more thorough description of the protein fold space, allowing for more accurate homology modeling and fragment generation. For systems of larger size and dynamic complexity, automated methods benefit from advances in selective isotope labeling schemes [ 244 ], the use of probes with favorable relaxation properties, such as methyl groups [ 245 ], and utilization of sparse restraints [ 94 , 104 , 128 ]. Here, highly-parallel, iterative protocols, such as RASREC-Rosetta, can lead to a drastic improvement in sampling efficiency and accurately determine near-native structures from sparser datasets. Knowing that sequence covariance can provide sufficient long-range information to model the folds of mediumsized proteins, several research groups have moved on to incorporate evolutionary information which drastically reduces the computational costs required by more data-oriented approaches. With the advent of sparse data recorded for larger systems, the mining of evolutionary information from genome sequencing and the fine-tuning of sidechain conformations according to chemical shift data, the next generation of methods will aim to deliver a more accurate view of biomolecular structures and their dynamics towards a new renaissance in structure determination by NMR methods.
The authors would like to thank Oliver Lange, Robert Vernon, Yang Shen, Jinfa Ying, Paolo Rossi, Flemming Hansen, Kostas Tripsianes, David Baker and Ad Bax for helpful discussions over the years. This manuscript was supported in part by funds from the Intramural research program of the NIAID, NIH, a K-22 Career Development and an R35 Outstanding Investigator Award to N.G.S. through NIAID (AI112573) and NIGMS(R35GM125034), respectively. Research reported in this publication was supported by the Office Of The Director, NIH, under Award Number S10OD018455.
Å | Angstrom |
µs | microsecond |
aa | amino acid |
aLP | α-lytic protease |
AAIG | amino acid index group |
ARIA | ambiguous restraints for iterative assignment |
AutoNOE-Rosetta | automatic NOESY assignment-rosetta |
BLOSUM | block substitution matrix |
BMRB | biological magnetic resonance bank |
CASD-NMR | community wide assessment of NMR structure determination |
CASP | critical assessment of methods of protein structure prediction |
CEST | chemical-exchange saturation transfer |
CHEOPS | chemical shift de novo structure derivation protocol employing singular value decomposition |
CHESHIRE | chemical shift restraints |
CMV | cytomegalovirus |
COSY | correlation spectroscopy |
CPMG | carr-purcell-meiboom-gill |
Cryo-EM | cryo-electron microscopy |
CS-HM-Rosetta | chemical shift-homology modeling-rosetta |
CS-Rosetta-RNA | chemical shift-rosetta-ribonucleic acid |
CS-Rosetta | chemical shift-rosetta |
CS-RosettaCM | chemical shift-rosetta comparative modeling |
CSA | chemical shift anisotropy |
CS-MD | chemical shift restrained molecular dynamics |
CYANA | combined assignment and dynamics algorithm for nmr applications |
DFT | density functional theory |
DI | direct information |
DSSP | database of secondary structure assignments |
EC | evolutionary coupling |
EC-NMR | evolutionary coupling-nuclear magnetic resonance spectroscopy |
EPR | electron paramagnetic resonance |
EVFold | evolutionary fold |
FARFAR | fragment assembly of RNA with full atom refinement |
HADDOCK | high ambiguity driven docking |
HHSearch | HMM-HMM search |
HIV | human immunodeficiency virus |
HMM | hidden markov model |
HMQC | heteronuclear multiple quantum coherence |
HSQC | heteronuclear single quantum coherence |
I-TASSER | iterative threading assembly refinement |
Ig | immunoglobulin |
ILV | isoleucine leucine valine |
kDA | kilodalton |
MDa | megadalton |
MFR | molecular fragment replacement |
MPI | message passing interface |
ms | millisecond |
MSA | multiple sequence alignment |
NESG | northeast structural genomics |
NMR | nuclear magnetic resonance |
NOE | nuclear overhauser effect |
NOESY | nuclear overhauser effect spectroscopy |
NUS | non-uniform sampling |
PCS | pseudocontact shifts |
PDB | protein data bank |
PISCES | public server for culling sets of protein sequences from the PDB by sequence identity |
PLM | pseudo likelihood maximization Pomona: protein alignments obtained by matching of NMR assignments |
PRE | paramagnetic relaxation enhancement |
PROMEGA | proline omega angle prediction |
RASREC-Rosetta | resolution adapted structural recombination-rosetta |
RDC | residual dipolar coupling |
R.E.U. | rosetta energy units |
RM | regulatory module |
RMSD | root mean square deviation |
RNA | ribonucleic acid |
RosettaCM | rosetta comparative modeling |
RTT | regulator of Ty1 transposition |
SANS | small-angle neutron scattering |
SAXS | small-angle X-ray scattering |
SCOP | structural classification of proteins |
SH | src homology |
SPARTA | shifts predicted from analogy in residue type and torsion angle |
sPRE | solvent paramagnetic relaxation enhancement |
SrtA | sortase A |
STRIDE | structural identification |
SWA | stepwise assembly |
TALOS | torsion angle likelihood obtained from shift and sequence similarity |
TOCSY | total correlation spectroscopy |
Conflict of interest
The authors declare that they have no conflict of interest.
IMAGES
COMMENTS
Of the three important classes of primary NMR data—chemical shifts, coupling constants and relative integrated signal intensity—the first is the most diagnostic of the local chemical and...
The proton NMR chemical shift is affect by nearness to electronegative atoms (O, N, halogen.) and unsaturated groups (C=C,C=O, aromatic). Electronegative groups move to the down field (left; increase in ppm).
Chemical shift assignment is vital for nuclear magnetic resonance (NMR)–based studies of protein structures, dynamics, and interactions, providing crucial atomic-level insight. However, obtaining chemical shift assignments is labor intensive and requires extensive measurement time.
1 H NMR Chemical Shifts. Chemical shifts in NMR (Nuclear Magnetic Resonance) spectroscopy refer to the phenomenon where the resonant frequency of a nucleus in a magnetic field is influenced by its chemical environment.
The chemical shift of an NMR absorption in δ units is constant, regardless of the operating frequency of the spectrometer. A 1 H nucleus that absorbs at 2.0 δ on a 200 MHz instrument also absorbs at 2.0 δ on a 500 MHz instrument. The range in which most NMR absorptions occur is quite narrow. Almost all 1 H NMR absorptions occur from 0 to 10 ...
The researchers in this work present a deep learning-based method that delivers signal positions, chemical shift assignments, and structures of proteins within hours after completion of the...
Chemical shifts are highly sensitive probes harnessed by NMR spectroscopists and structural biologists as conformational parameters to characterize a range of biological molecules. Traditionally, assignment of chemical shifts has been a labor-intensive process requiring numerous samples and a suite of multidimensional experiments.
Herein, we have constructed a 54-layer-deep graph convolutional network for 13 C NMR chemical shift calculations, which achieved high accuracy with low time-cost and performed competitively with DFT NMR chemical shift calculations on structure assignment benchmarks.
The 1 H and 13 C NMR chemical shifts of compounds 1 – 57 are given in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6 and below we highlight some aspects of the assignments and structural features of the compounds analyzed.
Computer prediction of NMR chemical shifts plays an increasingly important role in molecular structure assignment and elucidation for organic molecule studies.