Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 18 October 2022

Rapid protein assignments and structures from raw NMR spectra with the deep learning technique ARTINA

  • Piotr Klukowski   ORCID: orcid.org/0000-0003-1045-3487 1 ,
  • Roland Riek   ORCID: orcid.org/0000-0002-6333-066X 1 &
  • Peter Güntert   ORCID: orcid.org/0000-0002-2911-7574 1 , 2 , 3  

Nature Communications volume  13 , Article number:  6151 ( 2022 ) Cite this article

13k Accesses

34 Citations

30 Altmetric

Metrics details

  • Machine learning
  • Solution-state NMR

Nuclear Magnetic Resonance (NMR) spectroscopy is a major technique in structural biology with over 11,800 protein structures deposited in the Protein Data Bank. NMR can elucidate structures and dynamics of small and medium size proteins in solution, living cells, and solids, but has been limited by the tedious data analysis process. It typically requires weeks or months of manual work of a trained expert to turn NMR measurements into a protein structure. Automation of this process is an open problem, formulated in the field over 30 years ago. We present a solution to this challenge that enables the completely automated analysis of protein NMR data within hours after completing the measurements. Using only NMR spectra and the protein sequence as input, our machine learning-based method, ARTINA, delivers signal positions, resonance assignments, and structures strictly without human intervention. Tested on a 100-protein benchmark comprising 1329 multidimensional NMR spectra, ARTINA demonstrated its ability to solve structures with 1.44 Å median RMSD to the PDB reference and to identify 91.36% correct NMR resonance assignments. ARTINA can be used by non-experts, reducing the effort for a protein assignment or structure determination by NMR essentially to the preparation of the sample and the spectra measurements.

Similar content being viewed by others

nmr chemical shift assignment

The 100-protein NMR spectra dataset: A resource for biomolecular NMR data analysis

nmr chemical shift assignment

DEEP picker is a deep neural network for accurate deconvolution of complex two-dimensional NMR spectra

nmr chemical shift assignment

Solution-state methyl NMR spectroscopy of large non-deuterated proteins enabled by deep neural networks

Introduction.

Studying structures of proteins and ligand-protein complexes is one of the most influential endeavors in molecular biology and rational drug design. All key structure determination techniques, X-ray crystallography, electron microscopy, and NMR spectroscopy, have led to remarkable discoveries, but suffer from their respective experimental limitations. NMR can elucidate structures and dynamics of small and medium size proteins in solution 1 and even in living cells 2 . However, the analysis of NMR spectra and the resonance assignment, which are indispensable for NMR studies, remain time-consuming even for a skilled and experienced spectroscopist. Attributed to this, the percentage of NMR protein structures in the Protein Data Bank (PDB) has decreased from a maximum of 14.6% in 2007 to 7.3% in 2021 ( https://www.rcsb.org/stats ). The problem has sparked research towards automating different tasks in NMR structure determination 3 , 4 , including peak picking 5 , 6 , 7 , 8 , 9 , resonance assignment 10 , 11 , 12 , and the identification of distance restraints 13 , 14 . Several of these methods are available as webservers 15 , 16 . This enabled semi-automatic 17 , 18 but not yet unsupervised automation of the entire NMR structure determination process, except for a very small number of favorable proteins 7 , 19 .

The advance of machine learning techniques 20 now offers unprecedented possibilities for reliably replacing decisions of human experts by efficient computational tools. Here, we present a method that achieves this goal for NMR assignment and structure determination. We show for a diverse set of 100 proteins that NMR resonance assignments and protein structures can be determined within hours after completing the NMR measurements. Our method, Art ificial I ntelligence for N MR A pplications, ARTINA (Fig.  1 ), combines machine learning for tasks that are difficult to model otherwise with existing algorithms—evolutionary optimization for resonance assignment with FLYA 12 , chemical shift database searches for torsion angle restraint generation with TALOS-N 21 , ambiguous distance restraints, network-anchoring and constraint combination for NOESY assignment 14 , 22 and simulated annealing by torsion angle dynamics for structure calculation with CYANA 23 . Machine learning is used in multiple flavors—deep residual neural networks 24 for visual spectrum analysis to identify peak positions (pp-ResNet) and to deconvolve overlapping signals (deconv-ResNet) in 25 different types of spectra (Supplementary Table  1 ), kernel density estimation (KDE) to reconstruct original peak positions in folded spectra, a deep graph neural network 25 , 26 (GNN) for chemical shift estimation within the refinement of chemical shift assignments, and a gradient boosted trees 27 (GBT) model for the selection of structure proposals.

figure 1

The flowchart presents the interplay between the main components of the automated protein structure determination workflow: Residual Neural Network (ResNet), FLYA automated chemical shift assignment, Graph Neural Network (GNN), Gradient Boosted Trees (GBT), and CYANA structure calculation.

A major challenge in developing ARTINA was the collection and preparation of a large training data set that is required for machine learning, because, in contrast to assignments and structures, NMR spectra are generally not archived in public data repositories. Instead, we were obliged to collect from different sources and standardize complete sets of multidimensional NMR spectra for the assignment and structure determination of 100 proteins.

In the following work, we describe the algorithm, training and test data, and results of ARTINA automated structure determination, which are on par with those achieved in weeks or months of human experts’ labor.

Benchmark dataset

One of the major obstacles for developing deep learning solutions for protein NMR spectroscopy is the lack of a large-scale standardized benchmark dataset of protein NMR spectra. To date, published manuscripts presenting the most notable methods for computational NMR, typically refer to less than 50 2D/3D/4D NMR spectra in their experimental sections. Even the well-recognized CASD-NMR competition cannot serve as a major source of training data for deep learning, since only the NOESY spectra of 10 proteins were used in the last round of the event 28 .

To make our study possible, we established a standardized benchmark of 1329 2D/3D/4D NMR spectra, which allows 100 proteins to be recalculated using their original spectral data (Fig.  2 and Supplementary Table  2 ). Each protein record in our dataset contains 5–20 spectra together with manually identified chemical shifts (usually depositions at the Biological Magnetic Resonance Data Bank, BMRB) and the previously determined (“ground truth”) protein structure (PDB record; Supplementary Table  3 ). The benchmark covers protein sizes typically studied by NMR spectroscopy with sequence lengths between 35 and 175 residues (molecular mass 4–20 kDa).

figure 2

PDB codes (or names, MH04, MDM2, KRAS4B, if PDB code unavailable) of the 100 benchmark proteins are ordered by the number of residues. The histogram shows the number of spectra for backbone assignment, side-chain assignment, and NOE measurement. Spectrum types in each data set are shown by light to dark blue circles indicating the number of individual spectra of the given type. The percentages of benchmark records that contain a given spectrum type are given at the top. Spectrum types present in less than 5% of the data sets have been omitted.

Automated protein structure determination

The accuracy of protein structure determination with ARTINA was evaluated in a 5-fold cross-validation experiment with the aforementioned benchmark dataset. Five instances of pp-ResNet and GBT were trained, each one using data from about 80% of the proteins for training and the remaining ones for testing. Since each protein was present exactly once in the test set, reported quality metrics were obtained directly in the cross-validation experiment, and no averaging between data splits was required. To deploy pp-ResNet and GBT models in our online system, we constructed an ensemble by averaging predictions of all 5 cross-validation models. The other models were trained only once using either generated data (deconv-ResNet, Supplementary Fig.  1 ) or BMRB depositions excluding all benchmark proteins (GNN, KDE).

In this experiment, we reproduced 100 structures in fully automated manner using only NMR spectra and the protein sequences as input. Since ARTINA has no tunable parameters and does not require any manual curation of data, each structure was calculated by a single execution of the ARTINA workflow. All benchmark datasets were analyzed by ARTINA in parallel with execution times of 4–20 h per protein.

All automatically determined structures, overlaid with the corresponding reference structures from the PDB, are visualized in Fig.  3 , Supplementary Fig.  2 , and Supplementary Movie  1 . ARTINA was able to reproduce the reference structures with a median backbone root-mean-square deviation (RMSD) of 1.44 Å between the mean coordinates of the ARTINA structure bundle and the mean coordinates of the corresponding reference PDB structure bundle for the backbone atoms N, C α , C’ in the residue ranges determined by CYRANGE 29 (Fig.  4a and Supplementary Table  4 ). ARTINA automatically identified between 459 and 4678 distance restraints (2198 on average over 100 proteins), which corresponds to 4.25–33.20 restraints per residue (Fig.  4b ). This number is mainly influenced by the extent of unstructured regions and the quality of the NOESY spectra. In agreement with earlier findings 30 , it correlates only weakly with the backbone RMSD to reference (linear correlation coefficient −0.38). As a more expressive validation measure for the structures from ARTINA, we computed a predicted RMSD to the PDB reference structure on the basis of the RMSDs between the 10 candidate structure bundles calculated in ARTINA (see “Methods”, Fig.  5 , and Supplementary Table  5 ). The average deviation between actual and predicted RMSDs for the 100 proteins in this study is 0.35 Å, and their linear correlation coefficient is 0.77 (Fig.  5 ). In no case, the true RMSD exceeds the predicted one by more than 1 Å.

figure 3

The structures are aligned with the RMSD to reference range as indicated on the left and hexagonal frames color-coded by their size as indicated above. Structures with no corresponding PDB depositions are marked by an asterisk.

figure 4

a Backbone RMSD to reference. b Number of distance restraints per residue. c Chemical shift assignment accuracy. Bars represent quantity values for benchmark proteins, identified by PDB codes (or protein names). Proteins are ordered by size, which is indicated by a color-coded circle. Values in the center of each panel are 10th, 50th, and 90th percentiles of values presented in the bar plot. Short/medium/long-range restraints are between residues i and j with | i – j | ≤ 1, 2 ≤ | i – j | ≤  4, and | i – j | ≥ 5, respectively.

figure 5

The predicted RMSD to reference (pRMSD) is calculated from the ARTINA results without knowledge of the reference PDB structure (see “Methods”) and, by definition, always in the range of 0–4 Å. For comparability, actual RMSD values to reference are also truncated at 4 Å (protein 2M47 with RMSD 4.47 Å). The dotted lines represent deviations of ±1 Å between the two RMSD quantities.

Additional structure validation scores obtained from ANSSUR 31 (Supplementary Table  6 ), RPF 32 (Supplementary Table  7 ), and consensus structure bundles 33 (Supplementary Table  8 ) confirm that overall the ARTINA structures and the corresponding reference PDB structures are of equivalent quality. Energy refinement of the ARTINA structures in explicit water using OPALp 34 (not part of the standard ARTINA workflow) does not significantly alter the agreement with the PDB reference structures (Supplementary Table  9 ). The benchmark data set comprises 78 protein structures determined by the Northeast Structural Genomics Consortium (NESG). On average, ARTINA yielded structures of the same accuracy for NESG targets (median RMSD to reference 1.44 Å) as for proteins from other sources (1.42 Å).

On average, ARTINA correctly assigned 90.39% of the chemical shifts (Fig.  4c ), as compared to the manually prepared assignments, including both “strong” (high-reliability) and “weak” (tentative) FLYA assignments 12 . Backbone chemical shifts were assigned more accurately (96.03%) than side-chain ones (86.50%), which is mainly due to difficulties in assigning lysine/arginine (79.97%) and aromatic (76.87%) side-chains. Further details on the assignment accuracy for individual amino acid types in the protein cores (residues with less than 20% solvent accessibility) are given in Supplementary Table  10 . Assignments for core residues, which are important for the protein structure, are generally more accurate than for the entire protein, in particular for core Ala, Cys, and Asp residues, which show a median assignment accuracy of 100% over the 100 proteins. The lowest accuracies are observed for core His (83.3%), Phe (83.3%), and Arg (87.5%) residues. The three proteins with highest RMSD to reference, 2KCD, 2L82, and 2M47 (see below), show 68.2, 83.8, and 75.7% correct aromatic assignments, respectively, well below the corresponding median of 85.5%. On the other hand, the assignment accuracies for the methyl-containing residues Ala, Ile, Val are above average and reach a median of 100, 97.6, and 98.6%, respectively.

The quality of automated structure determination and chemical shift assignment reflects the performance of deep learning-based visual spectrum analysis, presented qualitatively in Figs.  6 – 7 , Supplementary Fig.  3 , and Supplementary Movies  2 – 4 . In this experiment, our models (pp-ResNet, deconv-ResNet) automatically identified 1,168,739 cross-peaks with high confidence (≥0.50) in the benchmark spectra. All 1329 peak lists, together with automatically determined protein structures and chemical shift lists, are available for download.

figure 6

A fragment of a 15 N-HSQC spectrum of the protein 1T0Y is shown. Initial signal positions identified by the peak picking model pp-ResNet (black dots) are deconvolved by deconv-ResNet, yielding the final coordinates used for automated assignment and structure determination (blue crosses). a 1 , a 2 Initial peak picking marker position is refined by the deconvolution model. b 1 , b 2 pp-ResNet output is deconvolved into two components. c The deconvolution model supports maximally 3 components per initial signal. d Two peak picking markers are merged by the deconvolution model. e Peak picking output deconvolved into three components.

figure 7

A fragment of the 13 C-HSQC spectrum of protein 2K0M is shown. Initial signal positions identified by the peak picking model pp-ResNet (black dots) are deconvolved by deconv-ResNet, yielding the final coordinates used for automated assignment and structure determination (blue crosses).

Error analysis

The largest deviations from the PDB reference structure were observed for the proteins 2KCD, 2L82, and 2M47, for which the pRMSD consistently indicated low accuracy (Fig.  5 ). Significant deviations are mainly due to displacements of terminal secondary structure elements (e.g., a tilted α-helix near a chain terminus), or inaccurate loop conformations (e.g., more flexible than in the PDB deposition). We investigated the origin of these discrepancies.

2KCD is a 120-residue (14.4 kDa) protein from Staphylococcus saprophyticus with an α-β roll architecture. Its dataset comprises 19 spectra (8 backbone, 6 side-chain, and 5 NOESY). The ARTINA structure has a backbone RMSD to PDB reference of 3.13 Å, which is caused by the displacement of the C-terminal α-helix (residues 105–109; Supplementary Fig.  4a ). Excluding this 5-residue fragment decreases the RMSD to 2.40 Å (Supplementary Table  11 ). The positioning of this helix appears to be uncertain, since an ARTINA calculation without the 4D CC-NOESY spectrum yields a significantly lower RMSD of 1.77 Å (Supplementary Table  12 ).

2L82 is a de novo designed protein of 162 residues (19.7 kDa) with an αβ 3-layer (αβα) sandwich architecture. Although only 9 spectra (4 backbone, 2 side-chain and 3 NOESY) are available, ARTINA correctly assigned 97.87% backbone and 81.05% side-chain chemical shifts. The primary reason for the high RMSD value of 3.55 Å is again a displacement of the C-terminal α-helix (residues 138–153). The remainder of the protein matches closely the PDB deposition (1.04 Å RMSD, Supplementary Fig.  4b ).

The protein with highest RMSD to reference (4.72 Å) in our benchmark dataset is 2M47, a 163-residue (18.8 kDa) protein from Corynebacterium glutamicum with an α-β 2-layer sandwich architecture, for which 17 spectra (7 backbone, 7 side chain and 3 NOESY) are available. The main source of discrepancy are two α-helices spanning residues 111–157 near the C-terminus. Nevertheless, the residues contributing to the high RMSD value are distributed more extensively than in 2L82 and 2KCD just discussed. Interestingly, 2 of the 10 structure proposals calculated by ARTINA have an RMSD to reference below 2 Å (1.66 Å and 1.97 Å). In the final structure selection step, our GBT model selected the 4.72 Å RMSD structure as the first choice and 1.66 Å as the second one (Supplementary Fig.  4c ). Such results imply that the automated structure determination of this protein is unstable. Since ARTINA returns the two structures selected by GBT with the highest confidence, the user can, in principle, choose the better structure based on contextual information.

In addition to these three case studies, we performed a quantitative analysis of all regular secondary structure elements and flexible loops present in our 100-protein benchmark in order to assess their impact on the backbone RMSD to reference (Supplementary Table  11 ). All residues in the structurally well-defined regions determined by CYRANGE 29 were assigned to 6 partially overlapping sets: (a) first secondary structure element, (b) last secondary structure element, (c) α-helices, (d) β-sheets, (e) α-helices and β-sheets, and (f) loops. Then, the RMSD to reference was calculated 6 times, each time with one set excluded. In total, for 66 of the 100 proteins the lowest RMSD was obtained if set (f) was excluded from RMSD calculation, and 13% benefited most from removal of the first or last secondary structure element (a or b). Moreover, for 18 out of the 19 proteins with more than 0.5 Å RMSD decrease compared to the RMSD for all well-defined residues, (a), (b), or (f) was the primary source of discrepancy. These results are consistent with our earlier statement that deviations in automatically determined protein structures are mainly caused by terminal secondary structure elements or inaccurate loop conformations.

Ablation studies

During the experiment, we captured the state of each structure determination at 9 time-points, 3 per structure determination cycle: (a) after the initial FLYA shift assignment, (b) after GNN shift refinement, and (c) after structure calculation (Fig.  1 ). Comparative analysis of these states allowed us to quantify the contribution of different ARTINA components to the structure determination process (Table  1 ).

The results show a strong benefit of the refinement cycles, as quantities reported in Table  1 consistently improve from cycle 1 to 3. The majority of benchmark proteins converge to the correct fold after the first cycle (1.56 Å median backbone RMSD to reference), which is further refined to 1.52 Å in cycle 2 and 1.44 Å in cycle 3. Additionally, within each chemical shift refinement cycle, improvements in assignment accuracy resulting from the GNN predictions are observed. This quantity also increases consistently across all refinement cycles, in particular for side-chains. Refinement cycles are particularly advantageous for large and challenging systems, such as 2LF2, 2M7U, or 2B3W, which benefit substantially in cycles 2 and 3 from the presence of the approximate protein fold in the chemical shift assignment step.

Impact of 4D NOESY experiments

As presented in Fig.  2 , 26 out of 100 benchmark datasets contain 4D CC-NOESY spectra, which require long measurement times and were used in the manual structure determination. To quantify their impact, we performed automated structure determinations of these 26 proteins with and without the 4D CC-NOESY spectra (Supplementary Table  12 ).

On average, the presence of 4D CC-NOESY improves the backbone RMSD to reference by 0.15 Å (decrease from 1.88 to 1.73 Å) and has less than 1% impact on chemical shift assignment accuracy. However, the impact is non-uniform. For three proteins, 2KIW, 2L8V, and 2LF2, use of the 4D CC-NOESY decreased the RMSD by more than 1 Å. On the other hand, there is also one protein, 2KCD, for which the RMSD decreased by more than 1 Å by excluding the 4D CC-NOESY.

These results suggest that overall the amount of information stored in 2D/3D experiments is sufficient for ARTINA to reach close to optimal performance, and only modest improvement can be achieved by introducing additional information redundancy from 4D CC-NOESY spectra.

Automated chemical shift assignment

Apart from structure determination, our data analysis pipeline for protein NMR spectroscopy can address an array of problems that are nowadays approached manually or semi-manually. For instance, ARTINA can be stopped after visual spectrum analysis, returning positions and intensities of cross-peaks that can be utilized for any downstream task, not necessarily related to protein structure determination.

Alternatively, a single chemical shift refinement cycle can be performed to get automatically assigned cross-peaks from spectra and sequence. We evaluated this approach with three sets of spectra: (i) Exclusively backbone assignment spectra were used to assign N, C α , C β , C’, and H N shifts. With this input, ARTINA assigned 92.40% (median value) of the backbone shifts correctly. (ii) All through-bond but no NOESY spectra were used to assign the backbone and side-chain shifts. This raised the percentage of correct backbone assignments to 94.20%. (iii) The full data set including NOESY yielded 96.60% correct assignments of the backbone shifts. These three experiments were performed for the 45 benchmark proteins, for which CBCANH and CBCAcoNH, as well as either HNCA and HNcoCA or HNCO and HNcaCO experiments were available. The availability of NOESY spectra had a large impact on the side-chain assignments: 86.00% were correct for the full spectra set iii, compared to 73.70% in the absence of NOESY spectra (spectra set ii). The presence of NOESY spectra consistently improved the chemical shift assignment accuracy of all amino acid types (Supplementary Tables  13 and 14 ). The improvement is particularly strong for aromatic residues (Phe, 61.6 to 76.5%, Trp 52.5 to 80%, and Tyr 71.4 to 89.7%), but not limited to this group.

The results obtained with ARTINA differ in several aspects substantially from previous approaches towards automating protein NMR analysis 3 , 4 , 7 , 12 , 17 , 18 , 19 , 35 . First, ARTINA comprehends the entire workflow from spectra to structures rather than individual steps in it, and there are strictly no manual interventions or protein-specific parameters to be adapted. Second, the quality of the results regarding peak identification, resonance assignments, and structures have been assessed on a large and diverse set of 100 proteins; for the vast majority of which they are on par with what can be achieved by human experts. Third, the method provides a two-orders-of-magnitude leap in efficiency by providing assignments and a structure within hours of computation time rather than weeks or months of human work. This reduces the effort for a protein structure determination by NMR essentially to the preparation of the sample and the measurement of the spectra. Its implementation in the https://nmrtist.org webserver (Supplementary Movie  5 ) encapsulates its complexity, eliminates any intermediate data and format conversions by the user, and enables the use of different types of high-performance hardware as appropriate for each of the subtasks. ARTINA is not limited to structure determination but can be used equally well for peak picking and resonance assignment in NMR studies that do not aim at a structure, such as investigations of ligand binding or dynamics.

Although ARTINA has no parameters to be optimized by the user, care should be given to the preparation of the input data, i.e., the choice, measurement, processing, and specification of the spectra. Spectrum type, axes, and isotope labeling declarations must be correct, and chemical shift referencing consistent over the entire set of spectra. Slight variations of corresponding chemical shifts within the tolerances of 0.03 ppm for 1 H and 0.4 ppm for 13 C/ 15 N can be accommodated, but larger deviations, resulting, for instance, from the use of multiple samples, pH changes, protein degradation, or inaccurate referencing, can be detrimental. Where appropriate, ARTINA proposes corrections of chemical shift referencing 36 . Furthermore, based on the large training data set, which comprises a large variety of spectral artifacts, ARTINA largely avoids misinterpreting artifacts as signals. However, with decreasing spectral quality, ARTINA, like a human expert, will progressively miss real signals.

Regarding protein size and spectrum quality, limitations of ARTINA are similar to those encountered by a trained spectroscopist. Machine-learning-based visual analysis of spectra requires signals to be present and distinguishable in the spectra. ARTINA does not suffer from accidental oversight that may affect human spectra analysis. On the other hand, human experts may exploit contextual information to which the automated system currently has no access because it identifies individual signals by looking at relatively small, local excerpts of spectra.

In this paper, we used all spectra that are available from the earlier manual structure determination. For most of the 100 proteins, the spectra data set has significant redundancy regarding information for the resonance assignment. Our results indicate that one can expect to obtain good assignments and structures also from smaller sets of spectra 37 , with concomitant savings of NMR measurement time. We plan to investigate this in a future study.

The present version of ARTINA can be enhanced in several directions. Besides improving individual models and algorithms, it is conceivable to integrate the so far independently trained collection of machine learning models, plus additional models that replace conventional algorithms, into a coherent system that is trained as a whole. Furthermore, the reliability of machine learning approaches depends strongly on the quantity and quality of training data available. While the collection of the present training data set for ARTINA was cumbersome, from now on it can be expected to expand continuously through the use of the https://nmrtist.org website, both quantitatively and qualitatively with regard to greater variability in terms of protein types. spectral quality, source laboratory, data processing (including non-linear sampling), etc., which can be exploited in retraining the models. ARTINA can also be extended to use additional experimental input data, e.g., known partial assignments, stereospecific assignments, 3 J couplings, residual dipolar couplings, paramagnetic data, and H-bonds. Structural information, e.g., from AlphaFold 38 , can be used in combination with reduced sets of NMR spectra for rapid structure-based assignment. Finally, the range of application of ARTINA can be generalized to small molecule-protein complexes relevant for structure-activity relationship studies in drug research, protein-protein complexes, RNA, solid state, and in-cell NMR.

Overall, ARTINA stands for a paradigm change in biomolecular NMR from a time-consuming technique for specialists to a fast method open to researchers in molecular biology and medicinal chemistry. At the same time, in a larger perspective, the appearance of generally highly accurate structure predictions by AlphaFold 38 is revolutionizing structural biology. Nevertheless, there remains space for the experimental methods, for instance, to elucidate various states of proteins under different conditions or in dynamic exchange, or for studying protein-ligand interaction. Regarding ARTINA, one should keep in mind that its applications extend far beyond structure determination. It will accelerate virtually any biological NMR studies that require the analysis of multidimensional NMR spectra and chemical shift assignments. Protein structure determination is just one possible ARTINA application, which is both demanding in terms of the amount and quality of required experimental data and amenable to quantitative evaluation.

Spectrum benchmark collection

To collect the benchmark of NMR spectra (Fig.  2 and Supplementary Table  2 ), we implemented a crawler software, which systematically scanned the FTP server of the BMRB data bank 39 , identifying data files relevant to our study. Additional datasets were obtained by setting up a website for the deposition of published data ( https://nmrdb.ethz.ch ), from our collaboration network, or had been acquired internally in our laboratory. NMR data was collected from these channels either in the form of processed spectra (Sparky 40 , NMRpipe 41 , XEASY 42 , Bruker formats), or in the form of time-domain data accompanied by depositor-supplied NMRpipe processing scripts. No additional spectra processing (e.g., baseline correction) was performed as part of this study.

The most challenging aspects of the benchmark collection process were: scarcity of data—only a small fraction of all BMRB depositions are accompanied by uploaded spectra (or time-domain data), lack of standards for NMR data depositions—each protein data set had to be prepared manually, as the original data was stored in different formats (spectra name conventions, axis label standards, spectra data format), and difficulties in correlating data files deposited in the BMRB FTP site with contextual information about the spectrum and the sample (e.g., sample characteristics, measurement conditions, instrument used). Manually prepared (mostly NOESY) peak lists, which are available from the BMRB for some of the proteins in the benchmark, were not used for this study.

Different approaches to 3D 13 C-NOESY spectra measurement had to be taken into account: (i) Two separate 13 C NOESY for aliphatic and aromatic signals. These were analyzed by ARTINA without any special treatment. We used ALI , ARO tags (Supplementary Movie  S5 ) to provide the information that only either aliphatic or aromatics shifts are expected in a given spectrum. (ii) Simultaneous NC-NOESY. These spectra were processed twice to have proper scaling of the 13 C and 15 N axes in ppm units, and cropped to extract 15 N-NOESY and 13 C-NOESY spectra. If nitrogen and carbon cross-peak amplitudes have different signs, we used POS , NEG tags to provide the information that only either positive or negative signals should be analyzed. (iii) Aliphatic and aromatic signals in a single 13 C-NOESY spectrum. These measurements do not require any special treatment, but proper cross-peak unfolding plays a vital role in aromatic signals analysis.

Overview of the ARTINA algorithm

ARTINA uses as input only the protein sequence and a set of NMR spectra, which may contain any combination of 25 experiments currently supported by the method (Supplementary Table  1 ). Within 4–20 h of computation time (depending on protein size, number of spectra, and computing hardware load), ARTINA determines: (a) cross-peak positions for each spectrum, (b) chemical shift assignments, (c) distance restraints from NOESY spectra, and (d) the protein structure. The whole process does not require any human involvement, allowing rapid protein NMR assignment and structure determination by non-experts.

The ARTINA workflow starts with visual spectrum analysis (Fig.  1 ), wherein cross-peak positions are identified in frequency-domain NMR spectra using deep residual neural networks (ResNet) 24 . Coordinates of signals in the spectra are passed as input to the FLYA automated assignment algorithm 12 , yielding initial chemical shift assignments . In the subsequent chemical shift refinement step, we bring to the workflow contextual information about thousands of protein structures solved by NMR in the past using a deep GNN 25 that was trained on BMRB/PDB depositions. Its goal is to predict expected values of yet missing chemical shifts, given the shifts that have already been confidently and unambiguously assigned by FLYA. With these GNN predictions as additional input, the cross-peak positions are reassessed in a second FLYA call, which completes the chemical shift refinement cycle (Fig.  1 ).

In the structure refinement cycle , 10 variants of NOESY peak lists are generated, which differ in the number of cross-peaks selected from the output of the visual spectrum analysis by varying the confidence threshold of a signal selected by ResNet between 0.05 and 0.5. Each set of NOESY peak lists is used in an independent CYANA structure calculation 22 , 23 , yielding 10 intermediate structure proposals (Fig.  1 ). The structure proposals are ranked in the intermediate structure selection step based on 96 features with a dedicated GBT model. The selected best structure proposal is used as contextual information in a consecutive FLYA run, which closes the structure refinement cycle .

After the two initial steps of visual spectrum analysis and initial chemical shift assignment, ARTINA interchangeably executes refinement cycles. The chemical shift refinement cycle provides FLYA with tighter restraints on expected chemical shifts, which helps to assign ambiguous cross-peaks. The structure refinement cycle provides information about possible through-space contacts, allowing identified cross-peaks (especially in NOESY) to be reassigned. The high-level concept behind the interchangeable execution of refinement cycles is to iteratively update the protein structure given fixed chemical shifts, and update chemical shifts given the fixed protein structure. Both refinement cycles are executed three times.

Automated visual analysis of the spectrum

We established two machine learning models for the visual analysis of multidimensional NMR spectra (see downloads in the Code availability section). In their design, we made no assumptions about the downstream task and the 2D/3D/4D experiment type. Therefore, the proposed models can be used as the starting point of our automated structure determination procedure, as well as for any other task that requires cross-peak coordinates.

The automated visual analysis starts by selecting all extrema \({{{{{\boldsymbol{x}}}}}}=\left\{{{{{{{\boldsymbol{x}}}}}}}_{1},{{{{{{\boldsymbol{x}}}}}}}_{2},\ldots,{{{{{{\boldsymbol{x}}}}}}}_{N}\right\}\) , \({{{{{{\boldsymbol{x}}}}}}}_{n}\in {{\mathbb{N}}}^{D}\) in the NMR spectrum, which is represented as a D -dimensional regular grid storing signal intensities at discrete frequencies. We formulated the peak picking task as an object detection problem, where possible object positions are confined to \({{{{{\boldsymbol{x}}}}}}\) . This task was addressed by training a deep residual neural network 24 , in the following denoted as peak picking ResNet (pp-ResNet), which learns a mapping \({{{{{{\boldsymbol{x}}}}}}}_{n}\to[0,\,1]\) that assigns to each signal extremum a real-valued score, which resembles its probability of being a true signal rather than an artefact.

Our network architecture is strongly linked to ResNet-18 24 . It contains 8 residual blocks, followed by a single fully connected layer with sigmoidal activation. After weight initialization with Glorot Uniform 43 , the architecture was trained by optimizing a binary cross-entropy loss using Adam 44 with learning rate 10 –4 and gradient clipping of 0.5.

To establish an experimental training dataset for pp-ResNet, we normalized the 1329 spectra in our benchmark with respect to resolution (adjusting the number of data grid points per unit chemical shift (ppm) using linear interpolation) and signal amplitude (scaling the spectrum by a constant). Subsequently, 675,423 diverse 2D fragments of size 256 × 32 × 1 were extracted from the normalized spectra and manually annotated, yielding 98,730 positive and 576,693 negative class training examples. During the training process, we additionally augmented this dataset by flipping spectrum fragments along the second dimension (32 pixels), stretching them by 0–30% in the first and second dimensions, and perturbing signal intensities with Gaussian noise addition.

The role of the pp-ResNet is to quickly iterate over signal extrema in the spectrum, filtering out artefacts and selecting approximate cross-peak positions for the downstream task. The relatively small network architecture (8 residual blocks) and input size of 2D 256 × 32 image patches make it possible to analyze large 3D 13 C-resolved NOESY spectra in less than 5 min on a high-end desktop computer. Simultaneously, the first dimension of the image patch (256 pixels) provides long-range contextual information on the possible presence of signals aligned with the current extremum (e.g., C α , C β cross-peaks in an HNCACB spectrum).

Extrema classified with high confidence as true signals by pp-ResNet undergo subsequent analysis with a second deep residual neural network (deconv-ResNet). Its objective is to perform signal deconvolution, based on a 3D spectrum fragment (64 × 32 × 5 voxels) that is cropped around a signal extremum selected by pp-ResNet. This task is defined as a regression problem, where deconv-ResNet outputs a 3 × 3 matrix storing 3D coordinates of up to 3 deconvolved peak components, relative to the center of the input image. To ensure permutation invariance with respect to the ordering of components in the output coordinate matrix, and to allow for a variable number of 1–3 peak components, the architecture was trained with a Chamfer distance loss 45 .

Since deconv-ResNet deals only with true signals and their local neighborhood, its training dataset can be conveniently generated. We established a spectrum fragment generator, based on rules reflecting the physics of NMR, which produced 110,000 synthetic training examples (Supplementary Fig.  1 ) having variable (a) numbers of components to deconvolve (1–3), (b) signal-to-noise ratio, (c) component shapes (Gaussian, Lorentzian, and mixed), (d) component amplitude ratios, (e) component separation, and (f) component neighborhood type (i.e., NOESY-like signal strips or HSQC-like 2D signal clusters). The deconv-ResNet model was thus trained on fully synthetic data.

Signal unaliasing

To use ResNet predictions in automated chemical shift assignment and structure calculation, detected cross-peak coordinates must be transformed from the spectrum coordinate system to their true resonance frequencies. We addressed the problem of automated signal unfolding with the classical machine learning approach to density estimation.

At first, we generated 10 5 cross-peaks associated with each experiment type supported by ARTINA (Supplementary Table  1 ). In this process, we used randomly selected chemical shift lists deposited in the BMRB database, excluding depositions associated with our benchmark proteins. Subsequently, we trained a Kernel Density Estimator (KDE):

which captures the distribution \({p}_{e}\left({{{{{\boldsymbol{x}}}}}}\right)\) of true peaks being present at position \({{{{{\boldsymbol{x}}}}}}\) in spectrum type \(e\) , based on N e = 10 5 cross-peaks coordinates \({{{{{{\boldsymbol{x}}}}}}}_{i}^{(e)}\) generated with BMRB data, and \(\kappa\) being the Gaussian kernel.

Unfolding a k -dimensional spectrum is defined as a discrete optimization problem, solved independently for each cross-peak \({{{{{{\boldsymbol{x}}}}}}}_{j}^{\left(e\right)}\) observed in a spectrum of type \(e\) :

where \({{{{{\boldsymbol{w}}}}}}\in{{\mathbb{R}}}^{k}\) is a vector storing the spectral widths in each dimension (ppm units), \({{\circ }}\)  is element-wise multiplication, \({{{{{\boldsymbol{s}}}}}}\in \,{{\mathbb{Z}}}^{k}\) is a vector indicating how many times the cross-peak is unfolded in each dimension, and \({{{{{{\boldsymbol{s}}}}}}}^{{{{{{\boldsymbol{*}}}}}}}\in {{\mathbb{Z}}}^{k}\) is the optimal cross-peak unfolding.

As long as regular and folded signals do not overlap or have different signs in the spectrum, KDE can unfold the peak list regardless of spectrum dimensionality. The spectrum must not be cropped in the folded dimension, i.e., the folding sweep width must equal the width of the spectrum in the corresponding dimension.

All 2D/3D spectra in our benchmark were folded in at most one dimension and satisfy the aforementioned requirements. However, the 4D CC-NOESY spectra satisfy neither, as regular and folded peaks both overlap and have the same signal amplitude sign. This introduces ambiguity in the spectrum unfolding that prevents direct use of the KDE technique. To retrieve original signal positions, 4D CC-NOESY cross-peaks were unfolded to overlap with signals detected in 3D 13 C-NOESY. In consequence, 4D CC-NOESY unfolding depended on other experiments, and individual 4D cross-peaks were retained only if they were confirmed in a 3D experiment.

Chemical shift assignment

Chemical shift assignment is performed with the existing FLYA algorithm 12 that uses a genetic algorithm combined with local optimization to find an optimal matching between expected and observed peaks. FLYA uses as input the protein sequence, lists of peak positions from the available spectra, chemical shift statistics, either from the BMRB 39 or the GNN described in the next section, and, if available, the structure from the previous refinement cycle. The tolerance for the matching of peak positions and chemical shifts was set to 0.03 ppm for 1 H, and 0.4 ppm for 13 C/ 15 N shifts. Each FLYA execution comprises 20 independent runs with identical input data that differ in the random numbers used in the optimization algorithm. Nuclei for which at least 80% of the 20 runs yield, within tolerance, the same chemical shift value are classified as reliably assigned 12 and used as input for the following chemical shift refinement step.

Chemical shift refinement

We used a graph data structure to combine FLYA-assigned shifts with information from previously assigned proteins (BMRB records) and possible spatial interactions. Each node corresponds to an atom in the protein sequence, and is represented by a feature vector composed of (a) a one-hot encoded atom type code (e.g., C α , H β ), (b) a one-hot encoded amino acid type, (c) the value of the chemical shift assigned by FLYA (only if a confident assignment is available, zero otherwise), (d) atom-specific BMRB shift statistics (mean and standard deviation), and (e) 30 chemical shift values obtained from BMRB database fragments. The latter feature is obtained by searching BMRB records for assigned 2–3-residue fragments that match the local protein sequence and have minimal mean-squared-error (MSE) to shifts confidently assigned by FLYA (non-zero values of feature (c) in the local neighborhood of the atom). The edges of the graph correspond to chemical bonds or skip connections. The latter connect the C β atom of a given residue with C β atoms 2, 3, and 5 residues apart in the amino acid sequence, and have the purpose to capture possible through-space influence on the chemical shift that is typically observed in secondary structure elements.

The chemical shift refinement task is defined as a node regression problem, where an expected value of the chemical shift is predicted for each atom that lacks a confident FLYA assignment. This task is addressed with a DeepGCN model 25 , 26 that was trained on 28,400 graphs extracted from 2840 referenced BMRB records 39 . Each training example was created by building a fully assigned graph out of a single BMRB record, and dropping chemical shift values (feature (c) above) for randomly chosen atoms that FLYA typically assigns either with low confidence or inaccurately.

Our DeepGCN model is designed specifically for de novo structure determination, as it uses only the protein sequence and partial shift assignments to estimate values of missing chemical shifts. Its predictions are used to guide the FLYA genetic algorithm optimization 12 by reducing its search range for assignments. The precise final chemical shift value is always determined by the position of a signal in the spectrum, rather than the model prediction alone.

Torsion angle restraints

Before each structure calculation step, torsion angle restraints for the ϕ and ψ angles of the polypeptide backbone were obtained from the current backbone chemical shifts using the program TALOS-N 21 . Restraints were only generated if TALOS-N classified the prediction as ‘Good’, ‘Strong’, or ‘Generous’. Given a TALOS-N torsion angle prediction of ϕ ± Δ ϕ , the allowed range of the torsion angle was set to ϕ ± max(Δ ϕ , 10°) for ‘Good’ and ‘Strong’ predictions, and ϕ ± 1.5 max(Δ ϕ , 10°) for ‘Generous’ predictions, and likewise for ψ .

Structure calculation and selection

Given the chemical shift assignments and NOESY cross-peak positions and intensities, the structure is calculated with CYANA 23 using the established method 22 that comprises 7 cycles of NOESY cross-peak assignment and structure calculation, followed by a final structure calculation. In total, 8 × 100 conformers are calculated for a given input data set using 30,000 torsion angle dynamics steps per conformer. The 20 conformers with the lowest final target function value are chosen to represent the solution structure proposal. The entire combined NOESY assignment and structure calculation procedure is executed independently 10 times based on 10 variants of NOESY peak lists, which differ in the number of cross-peaks selected from the output of the visual spectrum analysis. The first set generously includes all signals selected by ResNet with confidence ≥0.05. The other variants of NOESY peak lists follow the same principle with increasingly restrictive confidence thresholds of 0.1, 0.15, …, 0.5.

The CYANA structures calculations are followed by a structure selection step, wherein the 10 intermediate structure proposals are compared pairwise by a Gradient Boosted Tree (GBT) model that uses 96 features from each structure proposal (including the CYANA target function value 23 , number of long-range distance restraints, etc.; for details, see downloads in the Code availability section) to rank the structures by their expected accuracy. The best structure from the ranking is subsequently used as contextual information for the chemical shift refinement cycle (Fig.  1 ), or returned as the final outcome of ARTINA. The second-best final structure is also returned for comparison.

To train GBT, we collected a set of successful and unsuccessful structure calculations with CYANA. Each training example was a tuple ( s i , r i ), where s i is the vector of features extracted from the CYANA structure calculation output, and r i is the RMSD of the output structure to the PDB reference. The GBT was trained to take the features s i and s j of two structure calculations with CYANA as input, and to predict a binary order variable o ij , such that o ij = 1 if r i  <  r j , and 0 otherwise. Importantly, the deposited PDB reference structures were not used directly in the GBT model training (they are used only to calculate the RMSDs). Consequently, the GBT model is unaffected by methodology and technicalities related to PDB deposition (e.g., the structure calculation software used to calculate the deposited reference structure).

Structure accuracy estimate

As an accuracy estimate for the final ARTINA structure, a predicted RMSD to reference (pRMSD) is calculated from the ARTINA results (without knowledge of the reference PDB structure). It aims at reproducing the actual RMSD to reference, which is the RMSD between the mean coordinates of the ARTINA structure bundle and the mean coordinates of the corresponding reference PDB structure bundle for the backbone atoms N, C α , C’ in the residue ranges as given in Supplementary Table  4 . The predicted RMSD is given by pRMSD = (1 – t ) × 4 Å, where, in analogy to the GDT_HA value 46 , t is the average fraction of the RMSDs ≤ 0.5, 1, 2, 4 Å between the mean coordinates of the best ARTINA candidate structure bundle and the mean coordinates of the structure bundles of the 9 other structure proposals. Since t ∈ [0, 1], the pRMSD is always in the range of 0–4 Å, grouping all “bad” structures with expected RMSD to reference ≥ 4 Å at pRMSD = 4 Å.

Reporting summary

Further information on research design is available in the  Nature Research Reporting Summary linked to this article.

Data availability

References structures: PDB Protein Data Bank ( https://www.rcsb.org/ ; accession codes in Fig.  2 and Supplementary Table  3 ).

Spectra and reference assignments: BMRB Biological Magnetic Resonance Data Bank ( https://bmrb.io/ ; entry IDs in Supplementary Table  3 ).

Peak lists, assignments, and structures: https://nmrtist.org/static/public/publications/artina/ARTINA_results.zip and in the ETH Research Collection under DOI 10.3929/ethz-b-000568621.

Source data for Figs.  2 , 4 , and 5 is available in Supplementary Tables  2 , 4 , and 5, respectively.

Code availability

The ARTINA algorithm is available as a webserver at https://nmrtist.org . pp-ResNet, deconv-ResNet, GNN, and GBT are available for download in binary form, together with architecture schemes, example input data, model input description, and source code that allows to read model files and make predictions ( https://github.com/PiotrKlukowski/ARTINA , https://nmrtist.org/static/public/publications/artina/models/ {ARTINA_peak_picking.zip, ARTINA_peak_deconvolution.zip, ARTINA_shift_prediction.zip, ARTINA_structure_ranking.zip}). These files provide a full technical specification of the components developed within ARTINA, and allow for their independent use in Python.

Existing software used: Python ( https://www.python.org/ ), CYANA ( https://www.las.jp/ ), TALOS-N ( https://spin.niddk.nih.gov/bax/software/TALOS-N ).

Wüthrich, K. NMR studies of structure and function of biological macromolecules (Nobel Lecture). Angew. Chem. Int. Ed. 42 , 3340–3363 (2003).

Article   CAS   Google Scholar  

Sakakibara, D. et al. Protein structure determination in living cells by in-cell NMR spectroscopy. Nature 458 , 102–105 (2009).

Article   ADS   CAS   Google Scholar  

Guerry, P. & Herrmann, T. Advances in automated NMR protein structure determination. Q. Rev. Biophys. 44 , 257–309 (2011).

Güntert, P. Automated structure determination from NMR spectra. Eur. Biophys. J. 38 , 129–143 (2009).

Garrett, D. S., Powers, R., Gronenborn, A. M. & Clore, G. M. A common sense approach to peak picking two-, three- and four-dimensional spectra using automatic computer analysis of contour diagrams. J. Magn. Reson. 95 , 214–220 (1991).

ADS   CAS   Google Scholar  

Koradi, R., Billeter, M., Engeli, M., Güntert, P. & Wüthrich, K. Automated peak picking and peak integration in macromolecular NMR spectra using AUTOPSY. J. Magn. Reson. 135 , 288–297 (1998).

Würz, J. M. & Güntert, P. Peak picking multidimensional NMR spectra with the contour geometry based algorithm CYPICK. J. Biomol. NMR 67 , 63–76 (2017).

Klukowski, P. et al. NMRNet: A deep learning approach to automated peak picking of protein NMR spectra. Bioinformatics 34 , 2590–2597 (2018).

Li, D. W., Hansen, A. L., Yuan, C. H., Bruschweiler-Li, L. & Brüschweiler, R. DEEP picker is a deep neural network for accurate deconvolution of complex two-dimensional NMR spectra. Nat. Commun. 12 , 5229 (2021).

Bartels, C., Güntert, P., Billeter, M. & Wüthrich, K. GARANT—A general algorithm for resonance assignment of multidimensional nuclear magnetic resonance spectra. J. Comput. Chem. 18 , 139–149 (1997).

Zimmerman, D. E. et al. Automated analysis of protein NMR assignments using methods from artificial intelligence. J. Mol. Biol. 269 , 592–610 (1997).

Schmidt, E. & Güntert, P. A new algorithm for reliable and general NMR resonance assignment. J. Am. Chem. Soc. 134 , 12817–12829 (2012).

Linge, J. P., O’Donoghue, S. I. & Nilges, M. Automated assignment of ambiguous nuclear overhauser effects with ARIA. Methods Enzymol. 339 , 71–90 (2001).

Herrmann, T., Güntert, P. & Wüthrich, K. Protein NMR structure determination with automated NOE assignment using the new software CANDID and the torsion angle dynamics algorithm DYANA. J. Mol. Biol. 319 , 209–227 (2002).

Allain, F., Mareuil, F., Ménager, H., Nilges, M. & Bardiaux, B. ARIAweb: a server for automated NMR structure calculation. Nucleic Acids Res. 48 , W41–W47 (2020).

Lee, W. et al. I-PINE web server: Aan integrative probabilistic NMR assignment system for proteins. J. Biomol. NMR 73 , 213–222 (2019).

Huang, Y. P. J. et al. An integrated platform for automated analysis of protein NMR structures. Methods Enzymol. 394 , 111–141 (2005).

Kobayashi, N. et al. KUJIRA, a package of integrated modules for systematic and interactive analysis of NMR data directed to high-throughput NMR structure studies. J. Biomol. NMR 39 , 31–52 (2007).

López-Méndez, B. & Güntert, P. Automated protein structure determination from NMR spectra. J. Am. Chem. Soc. 128 , 13112–13122 (2006).

Murphy, K. P. Probabilistic Machine Learning: An Introduction (MIT Press, 2022).

Shen, Y. & Bax, A. Protein backbone and sidechain torsion angles predicted from NMR chemical shifts using artificial neural networks. J. Biomol. NMR 56 , 227–241 (2013).

Güntert, P. & Buchner, L. Combined automated NOE assignment and structure calculation with CYANA. J. Biomol. NMR 62 , 453–471 (2015).

Güntert, P., Mumenthaler, C. & Wüthrich, K. Torsion angle dynamics for NMR structure calculation with the new program DYANA. J. Mol. Biol. 273 , 283–298 (1997).

Article   Google Scholar  

Kaiming, H., Xiangyu, Z., Shaoqing, R. & Jian, S. Deep residual learning for image recognition. In Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (2016).

Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. Preprint at https://arxiv.org/abs/1609.02907 (2016).

Chiang, W. L. et al. Cluster-GCN: An efficient algorithm for training deep and large graph convolutional networks. In Proc. 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) 257–266 (2019).

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. CatBoost: Unbiased boosting with categorical features. In Proc. 32nd Conference on Neural Information Processing Systems (NIPS) (2018).

Rosato, A. et al. The second round of Critical Assessment of Automated Structure Determination of Proteins by NMR: CASD-NMR-2013. J. Biomol. NMR 62 , 413–424 (2015).

Kirchner, D. K. & Güntert, P. Objective identification of residue ranges for the superposition of protein structures. BMC Bioinform. 12 , 170 (2011).

Buchner, L. & Güntert, P. Systematic evaluation of combined automated NOE assignment and structure calculation with CYANA. J. Biomol. NMR 62 , 81–95 (2015).

Fowler, N. J., Sljoka, A. & Williamson, M. P. A method for validating the accuracy of NMR protein structures. Nat. Commun . 11 , 6321 (2020).

Huang, Y. J., Powers, R. & Montelione, G. T. Protein NMR recall, precision, and F-measure scores (RPF scores): Structure quality assessment measures based on information retrieval statistics. J. Am. Chem. Soc. 127 , 1665–1674 (2005).

Buchner, L. & Güntert, P. Increased reliability of nuclear magnetic resonance protein structures by consensus structure bundles. Structure 23 , 425–434 (2015).

Koradi, R., Billeter, M. & Güntert, P. Point-centered domain decomposition for parallel molecular dynamics simulation. Comput. Phys. Commun. 124 , 139–147 (2000).

Herrmann, T., Güntert, P. & Wüthrich, K. Protein NMR structure determination with automated NOE-identification in the NOESY spectra using the new software ATNOS. J. Biomol. NMR 24 , 171–189 (2002).

Buchner, L., Schmidt, E. & Güntert, P. Peakmatch: A simple and robust method for peak list matching. J. Biomol. NMR 55 , 267–277 (2013).

Scott, A., López-Méndez, B. & Güntert, P. Fully automated structure determinations of the Fes SH2 domain using different sets of NMR spectra. Magn. Reson. Chem. 44 , S83–S88 (2006).

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596 , 583–589 (2021).

Ulrich, E. L. et al. BioMagResBank. Nucleic Acids Res. 36 , D402–D408 (2008).

Goddard, T. D. & Kneller, D. G. Sparky 3. (University of California, San Francisco, 2001).

Delaglio, F. et al. NMRPipe—A multidimensional spectral processing system based on Unix pipes. J. Biomol. NMR 6 , 277–293 (1995).

Bartels, C., Xia, T. H., Billeter, M., Güntert, P. & Wüthrich, K. The program XEASY for computer-supported NMR spectral analysis of biological macromolecules. J. Biomol. NMR 6 , 1–10 (1995).

Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. Proc. Mach. Learn. Res. 9 , 249–256 (2010).

Google Scholar  

Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2015).

Davies, E. R. Computer Vision (Academic Press, 2018).

Kryshtafovych, A. et al. New tools and expanded data analysis capabilities at the protein structure prediction center. Proteins 69 , 19–26 (2007).

Download references

Acknowledgements

We thank Drs. Frédéric Allain, Fred Damberger, Hideo Iwai, Harindranath Kadavath, Julien Orts, and Dean Strotz for providing unpublished spectra. This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No 891690 (P.K.), and a Grant-in-Aid for Scientific Research of the Japan Society for the Promotion of Science (P.G., 20 K06508).

Author information

Authors and affiliations.

Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093, Zurich, Switzerland

Piotr Klukowski, Roland Riek & Peter Güntert

Institute of Biophysical Chemistry, Goethe University Frankfurt, Max-von-Laue-Str. 9, 60438, Frankfurt am Main, Germany

  • Peter Güntert

Department of Chemistry, Tokyo Metropolitan University, 1-1 Minami-Osawa, Hachioji, 192-0397, Tokyo, Japan

You can also search for this author in PubMed   Google Scholar

Contributions

P.K. prepared training and test data sets, designed and trained machine learning models, performed experiments described in the manuscript, and implemented ARTINA within the nmrtist.org web platform. P.K. and P.G. wrote the software. P.K., R.R., and P.G. conceived the project, analyzed the results, and wrote the manuscript.

Corresponding authors

Correspondence to Piotr Klukowski , Roland Riek or Peter Güntert .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks Benjamin Bardiaux, Gaetano Montelione, Theresa Ramelot, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.  Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary info file #1, description of additional supplementary files, supplementary movie 1, supplementary movie 2, supplementary movie 3, supplementary movie 4, supplementary movie 5, reporting summary, peer review file, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Klukowski, P., Riek, R. & Güntert, P. Rapid protein assignments and structures from raw NMR spectra with the deep learning technique ARTINA. Nat Commun 13 , 6151 (2022). https://doi.org/10.1038/s41467-022-33879-5

Download citation

Received : 28 March 2022

Accepted : 30 September 2022

Published : 18 October 2022

DOI : https://doi.org/10.1038/s41467-022-33879-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

  • Gogulan Karunanithy
  • Vaibhav Kumar Shukla
  • D. Flemming Hansen

Nature Communications (2024)

  • Piotr Klukowski
  • Fred F. Damberger

Scientific Data (2024)

Manual and automatic assignment of two different Aβ40 amyloid fibril polymorphs using MAS solid-state NMR spectroscopy

  • Natalia Rodina
  • Riddhiman Sarkar

Biomolecular NMR Assignments (2024)

Overlay databank unlocks data-driven analyses of biomolecules for all

  • Anne M. Kiirikki
  • Hanne S. Antila
  • O. H. Samuli Ollila

5D solid-state NMR spectroscopy for facilitated resonance assignment

  • Alexander Klein
  • Suresh K. Vasa
  • Rasmus Linser

Journal of Biomolecular NMR (2023)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

nmr chemical shift assignment

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Chemical shift-based methods in NMR structure determination

Santrupti nerli.

a Department of Chemistry and Biochemistry, University of California Santa Cruz, Santa Cruz, CA 95064, United States

b Department of Computer Science, University of California Santa Cruz, Santa Cruz, CA 95064, United States

Andrew C. McShan

Nikolaos g. sgourakis.

Chemical shifts are highly sensitive probes harnessed by NMR spectroscopists and structural biologists as conformational parameters to characterize a range of biological molecules. Traditionally, assignment of chemical shifts has been a labor-intensive process requiring numerous samples and a suite of multidimensional experiments. Over the past two decades, the development of complementary computational approaches has bolstered the analysis, interpretation and utilization of chemical shifts for elucidation of high resolution protein and nucleic acid structures. Here, we review the development and application of chemical shift-based methods for structure determination with a focus on ab initio fragment assembly, comparative modeling, oligomeric systems, and automated assignment methods. Throughout our discussion, we point out practical uses, as well as advantages and caveats, of using chemical shifts in structure modeling. We additionally highlight (i) hybrid methods that employ chemical shifts with other types of NMR restraints (residual dipolar couplings, paramagnetic relaxation enhancements and pseudocontact shifts) that allow for improved accuracy and resolution of generated 3D structures, (ii) the utilization of chemical shifts to model the structures of sparsely populated excited states, and (iii) modeling of side-chain conformations. Finally, we briefly discuss the advantages of contemporary methods that employ sparse NMR data recorded using site-specific isotope labeling schemes for chemical shift-driven structure determination of larger molecules. With this review, we aim to emphasize the accessibility and versatility of chemical shifts for structure determination of challenging biological systems, and to point out emerging areas of development that lead us towards the next generation of tools.

1. Introduction

Interpretation of chemical shifts serves as the primary work-horse for both solution-and solid-state nuclear magnetic resonance (NMR) studies of biological systems. Chemical shifts are highly reproducible, sensitive parameters with far-reaching utility in characterizing the structure and dynamics of a diverse range of biomolecules which carry out important cellular functions. When fully deciphered, chemical shifts report on the local magnetic environment of nuclei, allowing for insights into backbone secondary structure, sidechain conformations, dynamics, solvation, and hydrogen bonding [ 1 – 4 ]. In recent years, the notion that chemical shifts can be used to streamline the NMR structure determination process has been supported by landmark methods to determine high resolution structures solely from chemical shift data [ 4 – 6 ]. Many of these methods aim to circumvent the need to acquire and analyze additional restraints, typically provided by nuclear Overhauser effect (NOE), residual dipolar coupling (RDC) or paramagnetic relaxation enhancement (PRE) measurements [ 7 ]. More-over, the majority of chemical shift-based structure determination methods can be easily integrated with complementary classical approaches, where backbone resonance assignment is performed through standard triple-resonance experiments, and sidechain resonances are assigned through TOCSY (Total Correlation Spectroscopy)- and COSY (Correlation Spectroscopy)-type experiments [ 8 , 9 ]. More recent advances in chemical shift-based methods stem from the combination of new, highly sensitive experiments, and robust computational algorithms required for analysis of complex NMR datasets [ 5 ].

Currently, there are more than 119,000 and 12,000 3D structures in the Protein Data Bank (PDB) [ 10 ] solved by X-ray crystallography and NMR spectroscopy, respectively. Notwithstanding the limitations in size and dynamic complexity, NMR methods allow for high resolution studies of macromolecules in their functionally relevant, aqueous environment [ 11 ]. Typically, a diverse set of parameters, including chemical shifts, RDC, and NOE restraints are combined to yield NMR structural ensembles showing various levels of precision and accuracy [ 12 – 14 ]. The entire process involves a series of time-consuming steps, such as peak picking, chemical shift and NOE cross-peak assignment, structure calculation, validation and refinement [ 15 ]. Thus, while structure determination by X-ray crystallography is relatively streamlined towards high throughput applications, structure determination by NMR remains a lengthy, labor-intensive process where the main bottlenecks (assignment of backbone/sidechain resonances and NOE cross-peaks) would greatly benefit from automated procedures [ 16 ]. Moreover, conventional NMR structure determination methods rely on a large number of complementary multidimensional datasets and abundant restraint densities (of the order of 15–20 restraints per residue [ 17 ]) to obtain reliable, high resolution structures. Thus, the use of automated algorithms for chemical shift assignment [ 7 , 18 ] and NOE cross-peak assignment [ 7 , 15 ] is an attractive avenue to facilitate the structure determination process. Here, protocols which utilize chemical shifts to drive structure modeling can have a significant impact on the quality and efficiency of automated assignments.

Alongside the development of sophisticated NMR methods, remarkable progress has been achieved in the field of protein and nucleic acid structure prediction. In particular, the application of algorithms that make use of information from known structures in the PDB, together with physically realistic energy functions, have enabled modeling biomolecular structures with reasonable accuracy [ 19 , 20 ]. However, modeling structures exclusively from their sequence is extremely challenging and limited in scope to smaller systems as highlighted by the CASP (Critical Assessment of Techniques for Protein Structure Prediction) initiative [ 21 , 22 ]. Concurrently, the combination of computational methods together with sparse experimental data from a variety of techniques has provided a new paradigm for medium to low-resolution structural studies of macromolecular assemblies and other complex systems [ 23 , 24 ]. The integration of such methods with chemical shift data offers an opportunity for further automation of the NMR structure determination process. Thus, during the past decade, a large number of studies have highlighted the great promise held by this new approach towards studies of monomeric proteins, protein complexes and other biomolecules spanning a range of sizes and dynamic complexities ( Fig. 1 ). In this review, we discuss the development of NMR chemical shift-based methods with a focus on de novo structure determination of proteins in the solution-state ( Table 1 ). Specifically, we illustrate how the implementation of automated methods that take advantage of chemical shift data, either exclusively or in combination with additional experimental restraints, allows for accurate structure determination in a range of applications.

An external file that holds a picture, illustration, etc.
Object name is nihms-952934-f0001.jpg

Progress in structure determination of biological molecules utilizing NMR chemical shifts. De novo structures determined using sequence-based ( ab initio , NK-Lysin and Calbindin) or chemical shift-based (Interleukin-1β, SEN15, NeR45A, TCTP, ALG13, PSE-4 β-lactamase, MBP, Sc.ai5γ RNA) methods. Molecular weight (y-axis, kDa) is shown as a function of time (x-axis, year). The examples shown are Calbindin [ 58 ], NK-Lysin [ 57 ], Interleukin-1β [ 64 ], Sc.ai5γ RNA [ 210 ], SEN15 [ 59 ], NeR45A [ 60 ], TCTP [ 214 ], ALG13 [ 82 ], PSE-4 b-lactamase [ 81 ] and MBP [ 87 , 109 , 128 ]. PDB IDs and methods used are given for each example.

Evolution of methods (a subset mentioned in this review) that have influenced the development of NMR chemical shift-based structure determination.

YearDescription (method)Reference
1993NOE assignment and structure refinement approach (ASNO)[ ]
1995Introduction of ambiguous NOE distance restraints[ ]
1997Protein structure determination methods using fragment assembly ( )[ , ]
1997NMR structure calculation using ambiguous distance restraints in ARIA (Ambiguous Restraints for Iterative Assignment)[ , ]
1997Automated H and C chemical shift prediction (SHIFTY)[ ]
1997Prediction of DNA chemical shifts (NUCHEMICS)[ ]
1999Prediction of backbone torsion angles using chemical shifts (TALOS)[ ]
2000Structure prediction using MFR and NMR dipolar couplings[ ]
2001Prediction of N, Ca, Cb, and C′ chemical shifts using density functional database (SHIFTS)[ ]
2001Prediction of RNA chemical shifts (NUCHEMICS)[ ]
2003NMR chemical shift prediction using artificial neural networks (PROSHIFT)[ ]
2003Calculation of H, C, and N chemical shifts (SHIFTX)[ ]
2003Protein-Protein docking using ambiguous NMR distance restraints obtained from chemical shift perturbation (HADDOCK)[ ]
2003Macro molecular structure determination package (Xplor-NIH)[ ]
2005MFR approach to determine structures using chemical shifts and dipolar coupling homology (MFR+)[ ]
2006Protein structure prediction using NMR chemical shifts and unassigned NOESY data (ASDP)[ ]
2007Yet another chemical shift-based structure prediction method (CHESHIRE)[ ]
2007Chemical shift prediction from torsion angles and sequence homology (SPARTA)[ ]
2008Blind protein structure determination using NMR chemical shifts (CS-Rosetta)[ ]
2008A flexible protein-protein docking method guided by NMR chemical shifts (CamDock)[ ]
2009Determination of NMR chemical shifts from inter atomic distances (CamShift)[ ]
2009Determination of homo-oligomeric complexes using chemical shifts with docking protocol[ ]
2009Structure determination using chemical shift restraints and Monte Carlo simulations[ ]
2009Hybrid method for backbone torsion angle prediction from NMR chemical shifts (TALOS+)[ ]
2010Structure determination using backbone chemical shifts and RDC data from iterative CS-Rosetta (CS-RDC-Rosetta)[ ]
2010MD simulations of proteins using NMR chemical shifts (CS-MD)[ ]
2010Improved chemical shift prediction using artificial neural networks (SPARTA+)[ ]
2010A new web server for protein-protein docking using data derived from NMR (HADDOCK Web Server)[ ]
2010Xaa-Pro peptide bond conformation prediction using chemical shifts and amino acid sequence (Proline Omega angle prediction (PROMEGA))[ ]
2011New and improved chemical shift predictor (SHIFTX2)[ ]
2011Algorithm for modeling oligomers from chemical shifts and dipolar couplings (RosettaOligomers)[ ]
2011Modeling protein complexes using NMR chemical shifts (CHESHIRE/CamDock)[ ]
2011Applying pseudocontact shifts to perform protein-protein docking using HADDOCK[ ]
2011Chemical shift prediction of methyl groups (CH3SHIFT)[ ]
2012A new molecular fragment replacement method that uses protein docking algorithms to perform fragment assembly[ ]
2012Improved sampling in protein structure calculation guided by NMR restraints (RASREC-Rosetta)[ ]
2012Structure determination of large proteins using sparse NMR data collected from deuterated samples (RASREC-Rosetta)[ ]
2012Utilizing pseudocontact shifts to guide protein structure prediction (PCS-Rosetta)[ ]
2012Accurate structure modeling using sparse NMR data and restraints derived from homologous proteins (CS-HM-Rosetta)[ ]
2013Utilizing backbone amide pseudocontact shifts generated from paramagnetic tags to determine protein structures (GPS-Rosetta)[ ]
2013Backbone and sidechain torsion angle prediction using artificial neural networks (TALOS-N)[ ]
2013Applying proton chemical shifts to determine helical structures of nucleic acids (Chemical Shift Structure Derivation Protocol Employing
Singular Value Decomposition (CHEOPS))
[ ]
2014A new automated NOE assignment algorithm and structure calculation with improved sampling in Rosetta AutoNOE-Rosetta)[ ]
2014Using proton chemical shifts to predict non-canonical RNA motifs (CS-Rosetta-RNA)[ ]
2015Modeling of protein structures using chemical shift homology (CS-RosettaCM/Pomona)[ ]
2015Structure modeling using NMR data and evolutionary coupling restraints (EC-NMR)[ ]
2015Protein structure determination using sparse NMR data and evolutionary information (RASREC-Rosetta with evolutionary restraints)[ ]
2016HADDOCK2.2 web server for modeling protein complexes using NMR data[ ]
2016Protein structure determination using solvent accessible surface area from paramagnetic relaxation enhancement measurements (sPRE-CS-
Rosetta)
[ ]
2017Backbone and sidechain resonance assignment method using two 4D spectra[ ]

2. Modeling backbone torsion angles from chemical shifts

The strong dependence of isotropic chemical shifts on the local backbone geometry has motivated the development of methods to determine protein torsion (or dihedral) angles from a basic set of shifts (reviewed in [ 2 ]). The backbone and sidechain dihedral angles define the local conformation of a polypeptide chain, thus governing secondary structure, sidechain packing, and overall tertiary/quaternary folds. As a result, determination of peptide back-bone (ɸ, ψ and ω) and sidechain (χ i ) dihedral angles directly from chemical shifts provides valuable restraints during structure calculation and refinement [ 25 – 27 ], especially in the absence of NOE restraints. While torsion angles can be directly measured through scalar J-couplings [ 28 – 30 ] or dipole-dipole and dipole-chemical shift anisotropy (CSA) cross-correlated relaxation [ 31 , 32 ], these methods are less applicable to larger proteins where NMR resonances undergo significant line-broadening and reduction in signal-to-noise, with both effects limiting the applications of these experiments [ 33 ].

Whereas DFT (Density Functional Theory)-based calculations can provide valuable insights into the dependence of chemical shift values on local geometry for different nuclei (such as 1 H α , 15 N, 13 C α , 13 C β , 13 Cʹ), empirical approaches have generally been more successful in directly modeling the local backbone structure [ 4 ]. Here, torsion angles are predicted based on amino acid sequence and chemical shift similarity relative to a curated database of assigned chemical shifts/protein structure pairs derived from the PDB and Biological Magnetic Resonance Bank (BMRB) [ 34 ]. Several methods stemming from this approach have already been reviewed in [ 4 , 5 , 7 ]. As expected, the accuracy of these chemical shift-based methods is higher (>73% of the torsion angle predictions lie within 30° from corresponding angles in the reference structures) compared to approaches that exclusively use sequence information (>65% of the predictions lie within 36° from the reference angles) [ 35 ]. Larger databases, together with more consistent referencing of chemical shift values are expected to improve the accuracy and precision of torsion angle predictions even further [ 36 , 37 ]. In addition to torsion angles, alternative methods use chemical shift data to elucidate secondary structures [ 4 , 5 , 7 ] and less frequently occurring Xaa-Pro peptide bond conformations [ 38 ]. While all these methods provide local restraints, which can be applied directly in structure calculations, the majority of NMR structure determination programs are supplemented with additional sources of NMR data in order to obtain highly converged models.

3. Predicting chemical shifts from known structures

Alongside the efforts to derive structural restraints from chemical shifts, prediction of the chemical shifts of known structures is an active field of research where a variety of sequence-based, structure-based and hybrid approaches have been developed [ 4 , 5 ]. Accurate chemical shift predictions from an available X-ray structure can actively aid in making chemical shift assignments, as well as in structure modeling and validation [ 5 , 39 ]. Analogous to dihedral angle prediction methods discussed in Section 2, sequence-based chemical shift prediction methods are based on the concept that sequence similarity often results in local structure and chemical shift homology. This idea forms the basis of SHIFTY [ 40 ], an early method, that is able to predict 1 H and 13 C backbone chemical shifts with a Pearson’s correlation coefficient (between experimental and predicted) >0.85 for all atoms (proton and carbon) when the query protein shares >35% sequence identity to a reference structure with known chemical shift assignments available in the BMRB. Following SHIFTY, there have been a number of methods (extensively reviewed in [ 4 , 5 , 7 ]), which can predict chemical shifts of up to 40 atom types in less than a few seconds per residue. The correlation coefficients of a few of these methods (SHIFTX [ 41 ], SPARTA (Shifts Predicted from Analogy in Residue type and Torsion Angle) [ 42 ], SPARTA+ [ 43 ], CamShift [ 44 ], SHIFTS [ 45 ], PROSHIFT [ 46 ] and SHIFTX2 [ 47 ]) range from 0.7 to 0.99 for 15 N, 13 C α , 13 C α , 13 C γ , 1 H N , and 1 H α backbone atoms; arguably, an exception is 1 H N , where SHIFTX2 (correlation coefficient of 0.97) outperforms other methods by a large margin (correlation coefficient of other methods lie between 0.51 and 0.71), when tested on a benchmark set of 61 proteins [ 47 ]. SHIFTX/SHIFTX2 and SPARTA/SPARTA+ are comparable in performance and widely used methods due to their speed and accuracy. The practical utility of some of these methods in de novo structure determination is discussed throughout this manuscript.

4.  De novo structure determination from chemical shift data

Structure prediction methods have shown great success for small to medium sized proteins (<150 residues) using various strategies, including ab initio [ 48 , 49 ], comparative modeling [ 50 ], fold prediction and threading [ 51 ]. However, de novo modeling of larger proteins remains a challenging problem owing to the number of feasible solutions to the conformational search problem [ 52 ]. In spite of the computational complexities involved in ab initio structure determination, there has been significant progress in the development of sophisticated methods in the past two decades. Rosetta [ 53 ], QUARK [ 54 ] and I-TASSER (Iterative Threading Assembly Refinement) [ 55 ] are a few software packages that have been widely applied to construct 3D structural models starting only from a query amino acid sequence. These methods are particularly useful when no homologs with known structures can be identified in the PDB, which is often the case for larger proteins [ 48 ].

Bowie and Eisenberg pioneered the field of ab initio prediction with their concept of generating protein models from an assembly of short, overlapping backbone fragments derived from a structural database [ 56 ] and this idea laid the foundation for several early implementations of ab initio methods [ 57 , 58 ]. In these methods, the selection of fragments from a high resolution protein structure database is based on sequence or secondary structure homology. Following selection, fragments are assembled using Monte Carlo-simulated annealing methods that minimize physically realistic energy functions to produce 3D structural models. Although these methods can produce low-energy models exhibiting the native fold for small proteins (<100 residues), larger targets pose significant challenges due to the quality of fragments used for assembly and the exponential increase in the conformational search space. In order to attempt to overcome the drawbacks of these early ab initio methods, several protocols that exploit NMR chemical shifts have emerged (reviewed in [ 7 ]). A great majority of these methods employ the generalized fragment assembly framework ( Fig. 2 ). Here, sequence and chemical shifts are used to derive local structural features, such as torsion angle restraints and secondary structure information, which further guide the fragment selection from a database of high resolution X-ray structures. The selected fragments are then used to build low-resolution models starting from a fully extended protein chain, characterized by bond lengths, bond angles, and backbone torsion angles. Here, bond lengths and angles are typically fixed to ideal values and the peptide bond is assumed to be planar, therefore it is the backbone torsion angles (/ and w) that effectively define the conformation of a protein chain [ 59 , 60 ]. This reduction in the degrees of freedom from Cartesian to torsion angle space greatly boosts the performance of a search towards the native conformation using Monte Carlo-based optimization methods. Lastly, sidechain rotamers [ 61 ] and minor deviations from ideal values are introduced on low-resolution conformations, which undergo further refinement to reduce steric clashes, and finally to produce all-atom structural models.

An external file that holds a picture, illustration, etc.
Object name is nihms-952934-f0002.jpg

General pipeline for de novo structure determination using fragment assembly. Backbone fragments are first generated from high resolution structures obtained from a curated database derived from the PDB. Fragments are then ranked according to primary amino acid sequence information and/or chemical shift-based torsion angle predictions. The assembly of selected fragments generates low-resolution models, which are iteratively refined using a physically relevant energy function to yield the final structures.

An early fragment assembly method (Molecular Fragment Replacement or MFR) utilizes experimental chemical shifts and dipolar couplings to model low-resolution structures [ 62 ]. Akin to ‘‘Molecular Replacement” methods, widely used in X-ray crystallography refinement, this approach is inspired from previous work that determined local structural fragments using sparse NOE data [ 63 ]. Specifically, MFR performs a pairwise search of a fragment database where the best candidates are selected by a χ2-test that evaluates the difference between (i) measured and calculated dipolar couplings from a singular value decomposition procedure (dipolar homology) and (ii) experimental and predicted chemical shift values for each selected fragment. The well-fitting fragments provide backbone torsion angle restraints that are applied during low-resolution structure modeling. Finally, the predicted models are further refined in order to improve their agreement with experimental chemical shifts and dipolar couplings. The utility of MFR is highlighted by a measured backbone RMSD (Root Mean Square Deviation) of 1.2 Å (angstrom) between modelled and X-ray structures of ubiquitin [ 62 ], suggesting that folds for small proteins can be captured using solely the chemical shifts and dipolar couplings, thereby alleviating the need to acquire and analyze NOESY data. Further improvements have been made to this algorithm at various stages including fragment search, assembly, sidechain placement, and structure refinement by employing other NMR parameters, such as J-couplings and NOEs [ 64 , 65 ]. While the early MFR method could accurately model backbone structures of small proteins, a significant limitation remained with respect to sidechain placement [ 62 ], which has been addressed in more recent methods [ 64 , 65 ].

One of the methods that surpassed previously existing approaches in structure prediction accuracy, addressed known limitations and worked well for a wide range of molecular weights is CHESHIRE (Chemical Shift Restraints) [ 59 ]. The CHESHIRE procedure further extends the fragment-based strategy introduced by MFR [ 62 ] together with NMR chemical shifts to model protein structures. This procedure consists of three phases that follow a generalized fragment assembly framework ( Fig. 2 ). First, the 3PRED algorithm [ 59 ] is used to predict secondary structures of three- and nine-residue fragments using NMR chemical shifts in conjunction with sequence-based secondary structure propensities. Specifically, the experimental chemical shifts are used to estimate the probability of an amino acid adopting a given secondary structure type. Additionally, secondary structure propensities are computed from known structures in the ASTRAL Structural Classification of Proteins (SCOP) database [ 66 ] according to a classification performed by the STRIDE (Structural Identification) algorithm [ 67 ]. Second, the TOPOS algorithm [ 59 ] similar to TALOS (Torsion Angle Likelihood Obtained from Shift and sequence similarity) [ 68 ] is used to predict backbone torsion angles using combined information drawn from experimental chemical shifts and previously determined secondary structural elements. The primary difference between TOPOS and TALOS lies in how they use chemical shifts for scoring (for instance, TOPOS ignores 1 H N chemical shifts). Following torsion angle prediction, candidate fragment conformations are selected from a structural database and filtered according to an energy function consisting of empirical terms sensitive to torsion angles, secondary structure and agreement between experimental and back-calculated chemical shifts (computed via SHIFTX). Third, fragments are assembled to generate low-resolution models using a Monte Carlo-simulated annealing method, where the query protein chain adopts a simplified representation consisting of only the backbone atoms and C β atoms of sidechain groups. Finally, sidechain rotamers from the Dunbrack library [ 61 ] are added to the low-resolution models following optimization of an all-atom energy function using Monte Carlo-based techniques. The all-atom energy function ( E ) contains a chemical shift term (back-calculated from the predicted models using SHIFTX) and a molecular dynamics (MD) force field according to Eq. (1) :

Here, the numerator recapitulates an MD-derived force field containing terms from van der Waals, electrostatic, solvent, pair-wise mean force and hydrogen bonding effects [ 59 ]. The denominator is an experimental scoring function, where CX measures the correlation between experimental and back-calculated chemical shifts using SHIFTX for Xϵ { H α , N , C α , C β }atoms. Corresponding k values are user-defined constants.

In the original implementation of CHESHIRE’s refinement procedure, the optimization of a combined MD/chemical shift scoring function is sufficient to bias the calculations towards the native state [ 59 ]. In particular, the derivatives of the chemical shift-based terms are not explicitly computed, which would be required for any approaches employing an MD-based integration of Newton’s equations of motion. Apart from exhibiting high accuracy in predicting benchmarked proteins of sizes up to ~14 kDa (Kilodalton) (backbone atom RMSD <1.8 Å) [ 59 ], this approach performed very well (RMSD values <2.6 Å for proteins up to 160 aa (amino acids) in size) in the third round of CASD-NMR (Community Wide Assessment of NMR Structure Determination) evaluation [ 69 , 70 ].

While the CHESHIRE procedure clearly demonstrated that chemical shift-derived fragments can be used to build nearnative structures, the same was independently highlighted by the Chemical Shift (CS)-Rosetta approach ( Fig. 3A ) [ 60 ]. CS-Rosetta combines a highly optimized ab initio fragment assembly protocol [ 53 ], employing different sampling schemes and multiple low-resolution energy functions, together with NMR chemical shifts to yield accurate structural models [ 60 ]. This protocol leverages high resolution structures from the PISCES (Public server for culling sets of protein sequences from the PDB by sequence identity) database [ 71 ], corresponding secondary structures assigned by DSSP (Dictionary of Secondary Structure Predictions) [ 72 ] and predicted chemical shifts of 13 C α , 13 C β , 13 C ’ , 15 N, 1 H β and 1 H N nuclei from SPARTA, to generate a library of native-like fragments, as opposed to fragments obtained from sequence information alone ( Fig. 3A ). In the earlier implementation of this protocol, the fragment selection was carried out using the MFR approach [ 62 ], which was later superseded by a modular algorithm that incorporates various experimental data terms and/or other prior biases during the fragment selection process [ 73 ]. The fragments selected using chemical shifts possess ɸ, ψ backbone torsion angles that are closer to their native values as shown for two representative arginine residues (at positions 45 and 61) in a 72 aa protein ( Fig. 3B and C ). Following three- and nine-residue fragment selection, assembly and refinement are carried out using Rosetta’s Metropolis Monte Carlo procedure. CS-Rosetta makes use of Rosetta’s simplicity and speed during the fragment assembly process. As mentioned previously, a protein chain in Rosetta is represented using torsion angle coordinates. By convention, if any torsion angle within a protein chain is perturbed, the angular motion affects all the atoms towards the C-terminus (called the lever-arm effect). To eliminate this effect, a protein chain is depicted using a directed (from N- to C-terminus), acyclic graph called fold tree [ 53 , 74 ]. In a fold tree, the nodes represent residues and the edges represent covalent connections. During the angular motion of torsions, breaks are introduced in a protein chain to eliminate the lever-arm effect. The fold information is preserved using long-range connections, which are also added as edges in the tree [ 53 , 74 ]. The use of fold tree framework greatly simplifies the assembly process by allowing the sampling of non-local structural features while remaining in torsion angle space [ 74 ].

An external file that holds a picture, illustration, etc.
Object name is nihms-952934-f0003.jpg

Chemical shifts aid the selection of high quality backbone fragments in CS-Rosetta. ( A ) C α RMSD to native structure among the top twenty, three-residue backbone fragments for each residue position in the sequence of a 72 aa. query protein. Fragments are selected based on sequence profile and secondary structure prediction in Rosetta (red). Alternatively, chemical shifts can be used to bias the fragment selection process in CS-Rosetta (blue). ( B and C ) Distribution of ϕ,ψ backbone dihedral angles in the top 100 fragments derived using Rosetta (red) or CS-Rosetta (blue) for two representative Arg residues at positions ( B ) 45 and ( C ) 61. Green dots indicate the ϕ,ψ dihedral angles observed in the native structure (X-ray).

The low-resolution phase employs a simplified representation of a protein chain, where only backbone heavy atoms and a centroid site representing the sidechain are present for every residue. Thereafter, Monte Carlo fragment trial steps are applied, starting from a fully extended protein chain, to yield collapsed backbone folds. In this phase, the low-resolution Rosetta energy function [ 75 – 77 ] is continuously minimized while sampling fragments from the library. In the high resolution phase, sidechain atoms are added to low-resolution structures. The placement of sidechain atoms is challenging due to the exponential number of possible rotamer combinations for all amino acids in a query protein sequence. To solve this problem, Rosetta uses a module called packer [ 53 ]. Packer selects feasible rotamers for each amino acid from the Dunbrack library [ 61 ], and further uses Monte Carlo-simulated annealing method to search for the optimal rotamer combinations. The high resolution (or full-atom) models are further refined using a Monte Carlo- and gradient-based optimization process that performs small backbone perturbations to resolve steric clashes. The final full-atom models are retained based on a high resolution scoring function [ 75 – 77 ], and quality of fit of the predicted models to the experimental chemical shifts. The compliance of the predicted models to experimental shifts is assessed by back-calculating the chemical shifts from the models using SPARTA. The chemical shift deviation in the models is further used to adjust the Rosetta all-atom energy, E, according to Eq. (2) :

Here, i and j are the nuclei and residues respectively δ i , j p r e d is the back-calculated chemical shift value obtained from SPARTA, δ i , j e x p is the experimental chemical shift value, σ i;j is a standard deviation and c is a weight factor, which can be optimized according to benchmark calculations.

The reliability and accuracy of CS-Rosetta were demonstrated through the comparison of predicted models with structures determined experimentally by the Northeast Structural Genomics (NESG) consortium [ 60 ]. The lowest-energy predicted models were remarkably close to the solved X-ray or NMR structures (0.6–2.1 Å backbone RMSD). In this review, we illustrate the effects of chemical shift data in guiding structure modeling within CS-Rosetta for a 15.5 kDa target protein, RTT103 (Regulator of Ty1 Transposition) ( Fig. 4 ) [ 78 , 79 ]. Here, models calculated using chemical shifts are closer to the native structure (PDB ID 2KM4) [ 80 ] and are well-converged, as opposed to the models calculated from sequence and secondary structure prediction alone ( Fig. 4A–C ). The distributions of Rosetta energies in the sampled models further highlight that, through the selection of more native-like fragments, chemical shifts greatly limit the conformational space and bias the search towards a native structure ( Fig. 4D ). When CS-Rosetta was originally tested, it could accurately model structures of proteins of sizes up to 15 kDa. However, the size remained well below the marked solution NMR standard of 25–30 kDa at the time ( Fig. 1 ). The size limitation of CS-Rosetta was subsequently remediated by an improvement in the ab initio fragment sampling protocol and the use of backbone RDC restraints along with chemical shifts- (CS-RDC-Rosetta: a b-version of RASREC-Rosetta, discussed below) [ 81 ]. This updated protocol allowed modeling of proteins up to 25 kDa in size.

An external file that holds a picture, illustration, etc.
Object name is nihms-952934-f0004.jpg

Improved de novo structure modeling using chemical shifts. Structure calculations for a 15.5 kDa protein (RTT103) are shown with and without the use of chemical shifts. ( A ) Ten lowest-energy structures generated by CS-Rosetta using manually assigned backbone chemical shifts. ( B ) Ten lowest-energy structures generated by CS-Rosetta without the use of chemical shifts. ( C ) Convergence vs backbone RMSD (Å) to native (PDB ID 2KM4), of the ten lowest-energy structures calculated with (green) and without (crimson) the use of chemical shifts. The native structures (PDB ID 2KM4) [ 80 ], which were idealized and refined in the Rosetta force-field for comparison, are additionally provided (purple). ( D ) Rosetta Energy (in R.E.U.) distributions among the 100 lowest-energy structures generated by CS-Rosetta with (green) and without (crimson) the use of chemical shift fragments in ab initio calculations. R.E.U. – Rosetta Energy Units. Data obtained from [ 216 ].

To further address the conformational sampling problem in order to increase the size limit (>25 kDa) of de novo structure determination, the Resolution Adapted Structural Recombination (RASREC)-Rosetta protocol was developed [ 82 ]. RASREC-Rosetta is designed to address difficult targets containing complex topologies with many sequentially distant (or nonlocal) interactions, and makes use of optimization strategies implemented by other protocols [ 74 , 83 – 86 ] alongside important features, including the fold tree framework from Rosetta 3.0 [ 53 ]. Each optimization strategy is customized and embedded in different phases of this multi-staged, iterative approach. Specifically, RASREC-Rosetta has an exploration stage followed by five resampling stages. Every stage retains the best scoring candidate structures that serve as a knowledge base, in conjunction with all available experimental data, for subsequent stages. In stages 1–3, the protein chain represented as a fold tree explores different backbone fragments and long-range β-strand pairings. The possible β-strand arrangements are made available through an annotated library constructed from all b- sheets within high resolution X-ray structures in the PDB. During stages 3–6, the sets of three- and nine-residue fragments being sampled are enriched by segments from the low-resolution structures generated in the earlier stages, to promote resampling of native-like features. Hence, consistently observed structural features that form the core of a protein are retained. These features aid in distinguishing incorrect folds from the correct one in the later stages. Most importantly, if any set of low-resolution candidates do not exhibit core features and seem unfit for full-atom refinement by stage 4, the protocol reverts to earlier stages and restarts from the fold trajectories of successful candidates. After approaching the full-atom refinement stages (stages 5,6), fold tree chain breaks introduced during beta-strand topology resampling are closed to yield realistic candidate structural models. In addition to the implementation of an array of optimization strategies, RAS-REC benefits from unprecedented parallelization through MPI (Message Passing Interface), which allows for batches of structure modeling calculations to be distributed across all available cores in a computer cluster. This greatly enhances the sampling range and overall performance of the method.

RASREC-Rosetta exhibited improved performance relative to conventional CS-Rosetta on a benchmark set of 11 proteins in 15–25 kDa size range where very sparse amide NOE restraints (of the order of tens) in addition to backbone chemical shifts were provided. Since then, it has shown considerable accuracy when applied to larger proteins with complex folds by several research groups. In particular, RASREC-Rosetta was used together with sparse NOE data recorded for 11 protein targets of sizes up to 40 kDa [ 87 ]. Here, the collection of high quality structural restraints from fully protonated samples poses a challenge due to slower rotational diffusion rates of larger proteins (>20 kDa), which leads to low signal-to-noise ratios. The authors addressed this drawback by employing a selective methyl labeling scheme (ILV, Isoleucine Leucine Valine) in a perdeuterated background to record methyl-methyl, methyl-amide and amide-amide 1 H NOE contacts at increased sensitivity and resolution [ 87 – 93 ]. These sparse sets of NOE restraints, together with experimental chemical shifts of the backbone atoms, were used by RASREC-Rosetta to model structures at high levels of accuracy (median C α RMSD <2 Å). In another application, a structural model of the murine cytomegalovirus (CMV) immunoevasin m04, a 23 kDa protein, was determined by RASREC-Rosetta using chemical shifts together with several data sets of complementary NOE and RDC measurements [ 94 ]. By progressively increasing the number of structural restraints supplied to RASREC-Rosetta, well-converged structural models were obtained revealing a novel, complex b-sheet topology similar to the immunoglobulin (Ig) protein fold ( Fig. 5A–C ). Notably, the degree of structural convergence increased by 16% and 22% with an increase in the total number of local restraints provided by a modest set of long-range (i) amide-amide ( Fig. 5A and B ) and (ii) amide-amide, amide-methyl, methyl-methyl ( Fig. 5C ) NOEs recorded using non-uniform sampling methods (NUS) ( Fig. 5D ). In addition to achieving higher levels of convergence, the use of all available restraints (amide-amide, amide-methyl, methyl-methyl NOE contacts and two RDC datasets) allowed modeling of structures that were within 0.6 Å (backbone RMSD) from the X-ray structure (determined independently by a different group and published after the RASREC-Rosetta models), and showed the correct placement of all core sidechains ( Fig. 5E ).

An external file that holds a picture, illustration, etc.
Object name is nihms-952934-f0005.jpg

Convergence of a novel fold adopted by the murine cytomegalovirus m04 protein during iterative RASREC-Rosetta structure calculations with various sparse NMR data sets. Ten lowest-energy structural models calculated using NMR chemical shifts together with ( A ) amide-amide NOE restraints ( B ) amide-amide NOE and RDC restraints ( C ) amide-amide, amide-methyl and methyl-methyl NOE together with RDC restraints, show that 73%, 89% and 95% of residues are converged within 3 Å (backbone RMSD), respectively. ( D ) 2D 13 C– 13 C projection of a 4D methyl HMQC-NOESY-HMQC (Heteronuclear Multiple Quantum Coherence) spectra recorded without (left) and with (right) non-uniform sampling (NUS). The NUS experiment was recorded with a sparsity of 1.56% and reconstructed using the SMILE algorithm in NMRPipe [ 246 , 247 ]. Both the spectra were acquired with similar parameters and the same net acquisition time. ( E ) An overlay of the m04 protein structure determined by solution NMR (cyan, PDB ID 2MIZ) [ 94 ] and X-ray crystallography (green, PDB ID 4PN6) [ 248 ] with backbone heavy atom RMSD of 0.6 Å. This figure is partially (Panels A, B and C) adapted from [ 94 ] with permission.

To illustrate the advantages of fragment-based approaches over torsion angle dynamics methods (the most popular approaches for NMR-based structure determination among all entries in the PDB), we calculated new structures of Abl kinase RM (Regulatory Module) protein complex using RASREC-Rosetta and compared them with PDB deposited models, which were generated using CYANA (Combined Assignment and Dynamics Algorithm for NMR Applications) (PDB ID 6AMW) [ 95 , 96 ]. The CYANA structures were modeled using chemical shift derived torsion angle restraints together with 3830 short- and long-range NOEs and an additional set (consisting of 80 restraints) of ‘NOE-derived’ hydrogen bond restraints [ 96 ]. In contrast, for the RASREC-Rosetta calculations we used chemical shift-derived torsion angle restraints and a sparse set (1547 long-range) of NOEs (a subset of restraints provided to CYANA; BMRB ID 30332) to guide the structure determination ( Fig. 6 ). The ten lowest-energy models ( Fig. 6A ) obtained using RASREC-Rosetta showed improved convergence with respect to the average relative orientation of the two individual domains relative to the models produced using CYANA, illustrated by structure superpositions performed using either the SH3 domain (residues 83–138) ( Fig. 6B ) (SH: Src Homology) or the SH2 domain (residues 139–237, linker and SH2) ( Fig. 6C ). As highlighted by our results, the use of fragment-based approaches with advanced sampling strategies and a more elaborate high resolution energy function leads to improved convergence of models from a lower restraint density ( Fig. 6D–F ).

An external file that holds a picture, illustration, etc.
Object name is nihms-952934-f0006.jpg

Comparison of the Abl kinase regulatory module structural ensembles calculated using RASREC-Rosetta and CYANA. ( A ) Globally aligned ten lowest-energy models of Abl kinase regulatory module (residues 83–237) calculated by RASREC-Rosetta (left) and CYANA (PDB ID 6AMW) (right) using amide-amide, amide-methyl, methyl-methyl NOE and H-bond (for CYANA) restraints. ( B ) Ten lowest-energy models calculated using RASREC-Rosetta (left) and CYANA (right) superimposed with respect to domain A (residues 83–138, SH3 domain). ( C ) Ten lowest-energy models calculated using RASREC-Rosetta (left) and CYANA (right) superimposed with respect to domain B (139–237, connector and SH2 domain). ( D ) Total number of amide-amide, amide-methyl and methyl-methyl NOE restraints used by RASREC-Rosetta and CYANA during iterative structure calculation and refinement. CYANA additionally uses a set of ‘NOE-derived’ hydrogen bonds (H-bond). Amide (orange): amide to amide and amide to methyl NOE contacts. Aliphatic (gray): methyl-methyl NOE contacts. H-bond restraints (red). ( E and F ) Average pairwise backbone heavy-atom RMSDs (in Å) using structural superimpositions performed with respect to different domain selections are shown per residue for structural ensembles calculated using CYANA and RASREC-Rosetta, respectively. Full alignment (blue): global alignment of ten lowest-energy structures. Domain A alignment (crimson): alignment of ten lowest-energy structures with respect to the SH3 domain. Domain B alignment: alignment of ten lowest-energy structures with respect to the connector and SH2 domain. Structural models of Abl kinase RM calculated using RASREC-Rosetta and the corresponding NMR data are available at https://dash.library.ucsc.edu/stash/dataset/doi:10.7291/D1Q94R .

Whereas MFR [ 62 ], CHESHIRE [ 59 ] and CS-Rosetta [ 60 , 82 ] utilize chemical shifts in conjunction with known structures to derive a selection of low-resolution backbone fragments, they do not take full advantage of the high resolution structural information encoded within the data. In these methods, the primary use of experimental chemical shifts is through a comparison against back-calculated (via SPARTA or SHIFTX) chemical shift values used to measure the compliance of selected fragments or models computed using Monte Carlo-based optimization methods. While Monte Carlo-based methods fare very well during structure refinement, they also have a very high rejection rate of trial moves (about 90%) during random exploration while modeling unknown structures. Furthermore, the chemical shift scoring terms computed using SPARTA or SHIFTX are non-differentiable, and therefore, the restraints derived from chemical shifts cannot be used directly to perform a uniform exploration of the conformational phase (/ and w) space. To address this bottleneck, several research groups have incorporated chemical shifts directly as differentiable distance restraints [ 44 , 97 , 98 ]. Notably, CamShift applies chemical shift restraints during MD simulations (Chemical Shift restrained Molecular Dynamics (CS-MD)) [ 44 , 97 , 98 ]. Towards this end, Cam-Shift models NMR chemical shifts as polynomial functions of interatomic distances, deviations from random coil values, dihedral angles, ring current and hydrogen bonding effects. The distance-dependent term of the CamShift objective function is contributed by backbone, sidechain, and through-space atom pair correlations as highlighted by Eq. (3) :

Here, X ϵ {backbone, sidechain, through-space} , distance ij is the distance between atoms i and j , α ij and β ij are the parameters derived from known structures with assigned backbone chemical shifts in the database [ 99 ]. δ backbone captures the distance between a query atom and backbone atoms of the nearby residues along with additional distances contributed by the backbone atom pairs of neighbors. δ sidechain is used to acquire distance between query atom and the sidechain atoms of that residue. Lastly, δ through-space allows attainment of distances of all the atoms within 5 Å of a query atom excluding backbone atoms of the query residue and the neighboring residues that are obtained while calculating δ backbone .

In this approach, during every integration step of the MD simulation, an overall potential function is calculated by taking a difference between the CamShift-predicted chemical shifts and the experimental shifts. Here, CamShift computes the forces by directly evaluating the derivatives of the chemical shift potential with respect to the various interatomic distances in Eq. (3) , along the x, y and z coordinates [ 44 ]. Since MD simulations are carried out in the Cartesian coordinate system, the size of the system becomes a limiting factor for larger proteins. Therefore, combining such methods to perform refinement after the generation of an initial set of starting models computed quickly using existing fragment assembly approaches (such as CHESHIRE and CS-Rosetta) provides a promising avenue towards modeling larger, more complex protein folds [ 97 ].

5. Applications of chemical shifts in homology-based modeling

As stated by Anfinsen’s postulate, the sequence of amino acids in a protein contains sufficient information to determine its fold [ 100 ]. By extension, two or more evolutionarily related (or homologous) proteins that share considerable amino acid sequence similarity likely also have comparable 3D structural features. Classical methods, including MODELLER [ 101 ], I-TASSER [ 55 ] and threading protocols in Rosetta [ 102 ], have achieved success in performing homology (also referred to as comparative or template-based) modeling even when the similarity between the query and template sequences is low (up to 20–25%) [ 103 ]. Nonetheless, these methods generally require a high (>40%) degree of sequence similarity to the template to obtain reliable models for larger proteins. Alternative approaches combine information drawn from evolutionarily related proteins (or templates) with sparse experimental data to overcome the drawbacks of classical methods. In particular, NMR chemical shifts can supplement sequence information to guide template identification and alignment at lower sequence similarity levels, thereby helping to alleviate a problem that has plagued comparative modeling since its inception [ 104 ]. There are now robust approaches that have employed this concept, each unique in the way it selects templates and extracts restraints in order to model the structures of larger proteins with high accuracy.

The first method which combined comparative modeling algorithms with backbone and 13 C β chemical shifts to derive consistently accurate models of protein structures is CS-HM-Rosetta (Chemical Shift-Homology Modeling-Rosetta) [ 104 ]. In this approach, classical CS-Rosetta ab initio calculations are used together with evolutionary distance restraints [ 105 ] derived from homologous proteins (or templates with ~30% sequence identity) in the PDB to bias the search towards solutions that are consistent with both (i) the fold in template structure(s) and (ii) the backbone chemical shifts ( Fig. 7 , blue). In this way, the chemical shift data are used as a means to distinguish high quality alignments from incorrect alignments, both locally and globally. Relative to conventional comparative modeling protocols, this enables accurate template-based modeling in spite of low sequence similarity levels.

An external file that holds a picture, illustration, etc.
Object name is nihms-952934-f0007.jpg

Methods for structure determination using restraints derived from evolutionary information and NMR chemical shifts. (Blue) Flow diagram of CS-HM-Rosetta [ 104 ]. In CS-HM-Rosetta, the query sequence is aligned to the sequences of template (evolutionarily related) protein structures extracted from the PDB. A set of distance restraints from the template structures are derived using Gaussian probability densities (silver) for every pair of Cα atoms in the sequence of the query protein. These distance restraints are used along with sparse NMR data (NOEs, RDCs), for chemical shift fragment-based structure determination by CS-Rosetta. (Gray) Flow diagram of CS-RosettaCM/Pomona [ 109 ]. CS-RosettaCM/Pomona obtains chemical shift-derived torsion angles for a query protein using the backbone chemical shifts, and then performs pairwise alignment to the torsion angles and sequence of template structures in the PDB using a dynamic programming algorithm. Following pairwise alignment, possible template structures are selected and clustered. The representative templates are then selected from these clusters. Together with sparse NMR restraints, the filtered structures serve as templates for the CS-RosettaCM protocol. (Dark red) Flow diagram of EC-NMR [ 128 ] and RASREC-Rosetta with evolutionary restraints [ 127 ]. Here, a multiple sequence alignment is constructed using the query sequence and many template sequences with unknown structures. Related residues in space the exhibit covariance in the sequence alignment are then identified using statistical algorithms to derive structural restraints, termed evolutionary coupling (EC) restraints. These EC restraints are combined with NMR data (such as chemical shifts, NOEs, RDCs) and input to standard RASREC-Rosetta or CYANA for structure calculation.

In order to derive long-range structural restarints, CS-HM-Rosetta uses a probabilistic approach to establish a relation between a diverse input of pair-wise alignments to template sequences, with features in the corresponding template structures. Here, (i) all proteins in the PDB are aligned to the query sequence using HHSearch (HMM-HMM search; HMM: Hidden Markov Model) [ 106 ] where the alignment criterion is based on the predicted secondary structure of the query sequence versus the template secondary structure (via DSSP [ 72 ]); the alignment pairs with lower e-values are retained, (ii) every pair of residues that are ten or more positions apart along the query sequence is considered; if the distance between the C α atoms of the corresponding residues in the template structure is within 10 Å, then it is used to compute a multi-basin C α distance constraint, (iii) the joined distribution of distances obtained from all alignments is analyzed against a set of four alignment quality features, including the HHSearch e-value (local sequence similarity) [ 106 ], the BLOSUM62 (Block Substitution Matrix) score [ 107 ] for the aligned residue pairs (global sequence similarity), the nearest gap in a query sequence, and the number of Cβ atoms within 8 Å from other Cβ atoms in the template structure (buried surface). Finally, a multi-modal distribution of distance deviations is constructed for every C α atom pair in the sequence, and subsequently converted into a single distance restraint. Therefore, the confidence of each restraint is strengthened by combining distances computed for the same residue pair from multiple template structures, represented as a mixture of Gaussian distributions [ 105 ] ( Fig. 7 , silver).

The distance restraints drawn from evolutionarily related proteins directly influence the convergence and distribution of sampled Rosetta energies in explicit CS-HM-Rosetta calculations. If the input alignments are incorrect either locally or globally, then the derived distance restraints will not be consistent with the experimental chemical shifts, and will yield models with poor convergence and high energies relative to control calculations performed without the use of evolutionary distance restraints [ 108 ]. CS-HM-Rosetta’s ability to model accurate backbone structures (RMSD < 2 Å) and recover high degree (75–85%) of native sidechain rotamers demonstrates that the combination of NMR chemical shifts with evolutionary distance restraints can circumvent the need to analyze NOE data for targets with remote homologs in the PDB. Instead, the NOE data (if available) can be used for structure validation.

As an alternative to conventional sequence and predicted secondary structure-based alignment methods, the CS-RosettaCM/ Pomona (Chemical shift-Rosetta Comparative Modeling/Protein Alignments Obtained by Matching of NMR Assignments) [ 109 ] ( Fig. 7 , gray) protocol relies on the idea that NMR chemical shifts encode local structural homology. The key innovation in this procedure is a protein alignment module, Pomona, which uses TALOS-N [ 110 ] to estimate //w backbone torsion angle probability maps from 13 C α , 13 C β , 13 C γ , 15 N, 1 H α , and 1 H N chemical shifts for every amino acid in the query sequence. These maps are used to compute a substitution score measuring local similarity between the query and a template structure, given by the weighted contributions of backbone torsions, secondary structure and sequence similarity. A pairwise sequence alignment is then performed using a modified version of the Smith-Waterman dynamic programming algorithm [ 111 ], with an objective function that optimizes the substitution score augmented by a gap insertion penalty term [ 112 ]. The resulting alignment is further validated according to the consistency between experimental chemical shifts and SPARTA+ computed chemical shifts (for each residue in the query sequence that aligns to a residue in the database used by SPARTA+). In contrast to classical, sequence-based comparative modeling methods, homologous proteins with sequence identity 2’20% are excluded for the examples used in that study to prevent overfitting. All template structures identified by Pomona undergo normalized C α RMSD-based hierarchical clustering, and the ten top-ranking clusters with respect to alignment score are retained. Finally, the top two representatives from each cluster are used as structural templates for Rosetta’s comparative modeling protocol, RosettaCM [ 113 , 114 ].

More recently, the use of sequence covariance information to infer structural relationships between different pairs of residues along the query sequence has shown great promise for enabling reliable fold identification [ 115 ]. Stemming from the principle that evolutionary coupling correlates well with structural proximity, a growing body of work combines evolutionary data with sparse experimental restraints towards accurate modeling of protein structures [ 115 – 117 ]. Moreover, a global effort towards inferring a reliable network of EC (evolutionary coupling) restraints from fewer homologous sequences has improved the effectiveness of this approach [ 115 , 116 , 118 – 124 ]. These methods typically rely on global statistical methods, such as pseudo-likelihood maximization (PLM) [ 122 ] and/or direct information (DI) [ 125 ], and more recently deep learning methods [ 126 ] to identify relevant sequence features for robust identification of residue contacts. As a general rule, these methods first perform a HMM profile-based multiple sequence alignment (MSA) of the evolutionarily related protein sequences. Following MSA, a covariance matrix between all pairs of residues in the query sequence is created. The inverse of the covariance matrix provides conditional mutual information, which allows estimation of residue-residue contacts. Even though these methods exhibit high accuracy in predicting true structural contacts (>80% true positive rate among the top 50 predictions [ 122 , 125 ]), they also have a high false positive rate. Therefore, the extent to which such heterogeneous sets of restraints can be used to guide protein modeling calculations depends on the use of advanced sampling protocols, along with experimental data which can in principle distinguish correct from incorrect EC restraints on the basis of the calculation outcome.

The incorporation of EC restraints together with NMR chemical shifts within robust sampling protocols shows great promise towards identifying the native folds of larger proteins. As described earlier, RASREC-Rosetta has a high degree of accuracy and precision in modeling protein structures with complex folds, in the face of sparse experimental data and erroneous restraints. More recently, RASREC-Rosetta was extended to employ evolutionary contacts in addition to NMR chemical shifts and available sparse experimental data ( Fig. 7 , dark red) [ 127 ]. In this approach, restraints from evolutionary couplings are obtained using either the PLM or DI scoring methods in EVFold (Evolutionary Fold) [ 115 , 116 ]. NMR chemical shifts complement the EC restraints, by identifying a consistent network of restraints during RASREC-Rosetta calculations, and thus eliminating any structurally unrelated correlations recognized by EVFold. In addition, the energy function in RASREC-Rosetta is further adjusted to account for incorrectly drawn EC restraints [ 127 ].

An alternative, more integrative approach, EC-NMR (Evolutionary Coupling-NMR spectroscopy) combines evolutionary contact information with NMR data within the structure determination program, CYANA ( Fig. 7 , dark red) [ 128 ]. In this approach, EC restraints are inferred from the analysis of MSAs using the jack-hammer algorithm [ 129 ]. NMR data, including backbone and side-chain chemical shifts, NOESY peak lists and RDCs are recorded for ILV-methyl labeled protein samples [ 130 – 132 ]. Briefly, the EC restraints are combined with the previously assigned backbone and sidechain NMR chemical shifts, and used to assign the NOESY cross-peaks using the ASDP program [ 133 ]; these assigned restraints are then used in the full simulated annealing structure determination protocol in CYANA [ 95 ]. Here, the correct EC restraints and unambiguous NOESY assignments form a reliable network of contacts which helps in resolving ambiguities in the remaining NOESY assignments and in eliminating possible false positive EC restraints. Finally, the full set of assigned NMR restraints and evolutionary couplings are used to refine the preliminary CYANA models using Rosetta’s all-atom energy function [ 134 ].

6. Modeling protein complexes using chemical shifts

Protein complexes constitute over 50% of the proteome and participate in very many important biological processes [ 135 ]. NMR has enabled structural studies of such systems in vitro [ 136 ]; however full structure elucidation is challenging due to their large size and the presence of dynamics as well as the effects that become more pronounced at the interface between different subunits, which ultimately lead to exchange-induced line broadening of the NMR resonances [ 137 ]. An existing method, HADDOCK (High Ambiguity Driven Docking) (reviewed in [ 138 ]), makes use of chemical shift perturbations to model the structures of protein complexes [ 139 – 142 ]. Specifically, the differences in backbone and sidechain chemical shifts upon complex formation are used to derive ambiguous distance restraints [ 143 , 144 ], which further guide the docking of monomeric subunits under the assumption that the changes are localized to the binding surface(s). Similar to other semi-flexible docking methods, a major challenge remains in addressing any conformational changes that occur upon complex formation, for instance in domain-swapped protein assemblies and systems with more complex topologies.

RosettaOligomers leverages chemical shift fragments within Rosetta’s docking protocols to perform de novo modeling of symmetric oligomers [ 137 ]. This approach relies on CS-Rosetta to generate structures of monomeric subunits from sequence information, NMR chemical shifts and sparse NOE restraints (if available). In one branch of the protocol, the oligomers being modeled are assumed to contain relatively simple interfaces in which the monomers do not entwine significantly, and therefore the predicted subunits can be used in their free states. All models in the low-energy ensemble computed for the monomeric subunits are then docked together using sparse RDC restraints and user-defined symmetry information [ 145 ] ( Fig. 8 , Pathway 1). Another branch of this protocol makes use of more elaborate (and computationally demanding) fold and dock calculations to address cases of domain-swapped, or interleaved oligomeric proteins [ 146 – 149 ]. These cases can be diagnosed on the basis of the initial CS-Rosetta calculations performed for the monomeric subunits: if the resulting structural models exhibit divergence (>3 Å) after ab initio folding, then the oligomeric complex is likely to be inter-leaved ( Fig. 8 , Pathway 2). Although this method was developed originally to model symmetric domains, it can be extended to accommodate asymmetric domains [ 149 , 150 ]. RosettaOligomers was recently integrated with RASREC-Rosetta. This extended protocol uses NMR chemical shifts, RDCs and SAXS (Small-angle X- ray scattering) data to model larger complexes, as was demonstrated for a 33 kDa dimer target [ 151 ].

An external file that holds a picture, illustration, etc.
Object name is nihms-952934-f0008.jpg

Structure determination of protein complexes with RosettaOligomers guided by chemical shifts. Flow diagram for modeling protein complexes with RosettaOligomers [ 137 ] using chemical shifts, sparse NMR restraints derived from NOEs, RDCs, SAXS data sets, and user-specified symmetry definition. Pathway 1, designed to address oligomers from independently folded monomers (PDB ID 1C77, left): (i) CS-Rosetta produces a structural ensemble for each monomeric subunit and (ii) monomers are docked using protein–protein docking protocols in Rosetta [ 145 ]. Pathway 2, designed to address domain-swapped oligomers (PDB ID 2K5J, right): (i) chemical shift-derived backbone fragments, together with sparse NMR restraints, are used in one step with the fold and dock protocol [ 149 ]. Both approaches can be used in either a fully symmetric using Rosetta’s symmetry interface [ 249 ] or asymmetric mode.

CamDock performs ab initio modeling of protein complexes using the Chord program [ 152 ] which is based on the HEX [ 153 ] approach in its use of a spherical harmonics-based representation of protein surfaces. Here, backbone chemical shifts are used together with CHESHIRE’s molecular dynamics refinement strategy, as described in Section 4. CamDock was used to model E9-Im9, a 60 kDa protein complex, which resulted in a structural ensemble that is very close (1.18 Å C α RMSD) to the reference X-ray structure [ 152 ]. In a more recent work focusing on a Ztaq: Anti-ZTaq protein complex containing 144 amino acids in total [ 154 ], the CHESHIRE procedure was first applied to model the monomeric subunits in their bound states, which were then docked as rigid bodies using a protocol akin to CamDock. The docked protein complex is further optimized by CHESHIRE’s hybrid (MD/Monte Carlo) refinement protocol using an objective function that captures experimental and predicted chemical shifts together with molecular mechanical force fields (see Section 4). As a result of these key innovations, the combined CHESHIRE/CamDock approach generated structural models within 1 Å (backbone RMSD) from the reference X-ray structure [ 154 ].

7. Modeling transient, sparsely populated conformations from chemical shifts

Chemical shifts also constitute unique NMR observables in modeling the structures of biologically relevant, sparsely populated transient protein and nucleic acid conformations (termed ‘dark’, ‘invisible’ or ‘excited’ states) [ 155 , 156 ]. Such measurements are made possible by the development of a suite of NMR experiments to probe excited states with lifetimes in µs–ms (microsecond-millisecond) timescale. In particular, PRE/PCS measurements [ 157 ] are useful for cases of fast conformational exchange; rotating frame R 1p relaxation [ 158 ] and Carr–Purcell– Meiboom–Gill (CPMG) dispersion [ 159 ] for intermediate exchange; and chemical-exchange saturation transfer (CEST) [ 160 ] for slow exchange. The power of these methods is highlighted in key applications for the FF domain of human HYPA/FBP11 [ 161 ], Fyn SH3 domain [ 162 ], T4 lysozyme [ 163 ], HIV (human immunodeficiency virus)-1 transactivation response element RNA (Ribonucleic acid) [ 164 ], ubiquitin [ 165 ], Ca 2+ sensor signaling protein calmodulin [ 166 ], a transcriptional riboswitch [ 167 ], and E. coli enzyme dihydrofolate reductase [ 168 ]. These examples assume a system in which conformational exchange occurs between two states. Whereas multi-state exchange models have been explored by several research groups, they are usually limited to three states to avoid overfitting of the NMR data [ 156 ].

Recently, the integration of chemical shifts derived from the fitting of relaxation dispersion data with methods such as CS-Rosetta has enabled modeling of sparsely populated protein conformations. In these studies, typically, a series of CPMG dispersion experiments recorded at multiple magnetic fields and temperatures provide insights into the excited-states by fitting populations, chemical shift differences ( Δω ), and exchange rates ( k ex ) for the major (ground-state) and minor (excited-state) conformations [ 169 ]. CS-Rosetta has been employed together with backbone 1 H, 15 N, 13 C chemical shifts and amide RDCs to model the excitedstate conformations of folding intermediates of either (i) a T4 lysozyme mutant [ 163 ] or (ii) a HYPA/FBP11 FF domain [ 161 , 170 ]. Alternatively, excited-state structures can be elucidated using paramagnetic NMR restraints provided by PRE and PCS (pseudocontact shift) measurements [ 156 ]. Specifically, PCS restraints have been used for structure determination of a transient thioester intermediate formed between Staphylococcus aureus sortase A (SrtA) and a substrate peptide, which was inaccessible to traditional structure determination methods due to its short lifetime [ 171 ]. In that study, structural restraints for the SrtA-peptide intermediate were acquired by labeling SrtA with paramagnetic lanthanide tags which enabled the detection of 407 PCS restraints used for structure calculation in Xplor-NIH [ 172 ].

8. Modeling the conformations of sidechains from chemical shift data

Despite undeniable progress in chemical-shift-driven structure determination, the majority of studies have focused on information extracted from backbone chemical shifts, often resulting in lower resolution with respect to the orientation of sidechain groups. High resolution modeling of sidechain conformations is of considerable interest towards understanding protein function with respect to enzyme catalysis [ 173 ], protein interactions modes [ 174 ] and folding [ 175 ]. In the absence of sidechain chemical shift measurements, the most probable orientations of sidechains can be inferred from their lowest-energy conformations sampled from existing rotamer libraries [ 176 – 180 ]. Many ab initio structure determination approaches utilize such rotamer libraries to model static sidechain conformations using Monte Carlo-based optimization. Similar to protein backbone modeling using experimental and predicted chemical shifts, modeling sidechain conformations can be significantly improved by the use of 13 C chemical shifts. Towards this end, chemical shift prediction methods, such as CH3SHIFT [ 181 ], can help guide the rotamer selection and structure refinement processes. In practice, the utility of such methods is limited due to the difficulty , of predicting the γ-gauche effect, where the 13 C chemical shift of a given nucleus is influenced by its position relative to γ -substituents [ 182 ], along with the observation that sidechain conformations may be constrained in X-ray structures relative to a solution environment.

In solution, sidechain rotamers sample an ensemble of functionally relevant states, which can be unveiled –in principle– by a full analysis of NMR chemical shifts. Here, methyl groups, typically found at the hydrophobic core of proteins, have favorable relaxation properties and their resonances are useful when studying larger, more complex systems [ 174 ]. While stereospecific characterization of methyl groups is generally difficult to achieve using uniformly labeled samples, the use of stereospecific isotopic methyl labeling schemes (employing precursors that lead to pro-R and pro-S labeled leucine or valine residues [ 183 ]), can help distinguish these groups, even for larger targets. This can in turn aid in capturing different rotamer configurations for leucines and valines [ 184 ]. For example, determination of sidechain rotameric states for leucine Cd1/Cd2 groups, which can sample trans, gauche + or gauche - conformations, can be performed using measurements of chemical shift differences between stereo specifically assigned methyl groups (ΔCδ12 = δ ( 13 C δ1 ) – δ ( 13 C δ2 )) [ 185 ] or empirical 3 J CC ( 13 CH 3 – C α ) scalar bond couplings [ 186 , 187 ]. The former was demonstrated through a clear correlation between 13 C sidechain chemical shifts, χ 1 /χ 2 dihedral angles and rotamer conformations observed in high resolution structures [ 185 ]. In addition, a linear combination of ΔCδ 12 and empirical 3JCC scalar coupling values proved useful for the interpretation of more dynamic leucine rotamer populations for calbindin D9k [ 187 ]. This analysis was facilitated by the fact that the leucine χ 2 dihedral angle primarily samples trans and gauche + conformations in solution [ 188 ]. Simultaneously, isoleucine sidechain χ 2 rotamer conformations can be determined from chemical shifts [ 189 ] or J-coupling [ 190 ] measurements. Although isoleucine χ 2 rotamers can sample all four (trans, gauche + , gauche _ , gauche 100 ) distinct conformations, based on analysis of high resolution X-ray structures, only the trans and gauche _ conformers are populated in solution [ 189 ]. A similar approach has been applied for elucidating the χ 3 rotamer of the methionine Ce methyl group [ 191 ]. The situation is more complicated for valine because its sidechain χ 1 can sample multiple rotamer states (trans, gauche + or gauche _ ) in solution. Here, each valine χ 1 rotamer is derived from fitting measured 13 C γ1 /13 - C γ2 chemical shifts to a set of 20 χ 1 dihedral angles, allowing for accurate estimation of trans, gauche + and gauche _ rotamer populations [ 192 ].

In practice, these approaches have been applied to accurately determine methyl sidechain conformations in sparsely populated excited-states through the measurement of chemical shifts via CPMG relaxation dispersion experiments [ 188 , 189 ]. Measurement of methyl 13 C chemical shifts has also shown success in solid-state NMR studies [ 193 ]. Finally, the focus of sidechain modeling has primarily assumed a single rotameric state rather than a distribution of states. A major step towards addressing this problem has been the implementation of a curated database consisting of extensive dynamic sidechain rotamers sampled from the MD simulations of known protein structures [ 194 ]. However, incorporating the information content of dynamic sidechains requires computing long MD simulation trajectories, which can be limiting for larger systems.

9. Modeling nucleic acids using chemical shifts

While the use of chemical shifts in structural studies of nucleic acids is a fairly mature field (reviewed in [ 194 ]), the application of shifts towards full structure determination is relatively new [ 196 ]. Unlike proteins, the relation between NMR chemical shifts and corresponding nucleic acid structures is difficult to discern due to short dispersion range of shift values, and the limited availability of assigned chemical shifts in the BMRB, that has stymied the development of automated methods [ 195 , 197 ]. Nevertheless, 1 H chemical shifts still provide powerful restraints that can distinguish native from non-native nucleic acid conformations [ 198 ]. Furthermore, approaches have been developed to assign [ 199 ] or predict [ 200 – 203 ] chemical shifts to aid nucleic acid structure determination, refinement and validation.

De novo modeling of RNA structures containing non-canonical regions is challenging and several research groups have attempted this problem with limited success [ 204 – 207 ]. More recently, two complementary approaches, FARFAR (Fragment Assembly of RNA with Full Atom Refinement) [ 208 ] and SWA (Stepwise Assembly) [ 209 ] have exhibited favorable outcomes in modeling non-canonical regions using realistic force fields. In particular, FARFAR employs a fragment assembly approach [ 204 ] to model low resolution structures following full-atom refinement using a high resolution energy function. Alternatively, SWA builds each nucleotide in a stepwise manner recursively, where each step involves exhaustive enumeration of all possible conformations of the new residue. Although both these methods address conformational sampling bottlenecks that typically arise during RNA structure modeling, for more complicated cases (such as the UUAAGU hexaloop from 16S ribosomal RNA), the energy function does not provide sufficient discrimination of the native state [ 210 ]. These results spurred the development of CS-Rosetta-RNA (Chemical Shift-Rosetta-Ribonucleic Acid), where 1 H chemical shifts are exploited to perform de novo modeling of RNA structures ( Fig. 9 ) [ 210 ]. In this method, FARFAR and SWA are used in parallel to sample a large number of plausible RNA conformations. The resulting RNA structural models are energy minimized and ranked using an adjusted all-atom energy function according to Eq. (4) :

Here, E Rosetta is the standard all-atom energy function in Rosetta [ 208 ] without using chemical shifts δ i e x p and δ i c a l c are experimental and back-calculated non-exchangeable proton chemical shifts obtained using NUCHEMICS [ 200 , 202 ], and c is a weight factor. As shown by the results on a benchmark set of 23 targets, CS-Rosetta-RNA successfully demonstrates that the introduction of chemical shift-based terms in the high resolution potential drives the simulations towards a global, native-like minimum in the energy landscape. These approaches hold great promise for the modeling of protein/RNA complexes in future methods.

An external file that holds a picture, illustration, etc.
Object name is nihms-952934-f0009.jpg

Modeling RNA structures using CS-Rosetta-RNA. The query RNA sequence is used by specialized Rosetta protocols, FARFAR [ 208 ] and SWA [ 209 ] to construct a large number of plausible RNA structures. The predicted RNA structures are filtered using a combination of standard Rosetta energy function terms together with a penalty function which measures the difference between experimental and back-calculated chemical shift values ( Eq. (4) ).

10. Chemical shift-based structure determination and iterative NOE assignment

While chemical-shift based approaches offer an opportunity to determine the structures of proteins de novo , the main driving forces of conventional structural determination protocols by NMR are NOE measurements. NOE connectivities form a network of inter-proton distance restraints (typically within 6 Å in the 3D structure) that can be used directly in structure determination. Typically, hundreds to thousands of NOE restraints are required to define backbone and sidechain orientations during the process of structure modeling [ 15 , 16 , 211 ]. However, acquiring such restraints is a labor-intensive activity that involves analyzing and interpreting hundreds of cross-peaks in the NOESY spectra. NOE cross-peaks can be assigned to atom pairs in the protein sequence through the accurate mapping of proton chemical shifts to the cross-peak coordinates. The problem of ambiguity arises rapidly during this mapping process, as a result of spectral overlap. Therefore, automatic NOESY assignment and structure refinement has been an iterative process, which typically relies on highly complete (>90%) and accurately assigned NMR chemical shifts. Here, an initial, self-consistent network [ 212 ] of relatively unambiguous NOE restraints (of the order of 100, depending on target size and degree of spectral overlap) are drawn from the more unique mappings, to generate an initial set of low-resolution structures [ 15 ]. The low-resolution structures from early stages are then used to reduce uncertainty in the remaining unassigned or ambiguously assigned NOE restraints [ 144 , 213 ]. This general concept laid the foundation for the majority of NMR structure determination programs [ 7 , 15 ]. In these approaches, backbone dihedral angle restraints derived from chemical shifts using TALOS and similar methods can play an auxiliary role in biasing the search towards more native-like conformations that successively help assign more long-range NOEs

Several excellent reviews discuss the internal workings of many successful NOE assignment and structure refinement approaches [ 7 , 15 ]; here, we focus on fragment-based approaches that offer an opportunity to further explore this concept through a more optimal use of chemical shifts as a means to improve sampling of native-like conformations. For instance, Auto-NOE-Rosetta (Automatic NOESY Assignment-Rosetta) [ 214 ] leverages the powerful RASREC-Rosetta sampling engine (see Section 4) together with an iterative NOE assignment algorithm which uses network anchoring [ 212 ], agreement with a pool of preliminary models (already built into the RASREC algorithm), and presence of symmetry-related cross-peaks. The assigned NOEs are used to derive distance restraints at various ambiguity levels [ 144 ]. Low-confidence, ambiguous restraints can be combined with highly unambiguous restraints [ 144 , 215 ] and used within eight distinct conformational sampling stages by Auto-NOE-Rosetta.

To demonstrate the improved performance of chemical shift fragment-based approaches in NOE-driven structure determination, we performed new structure calculations of the 198 aa α-lytic protease (aLP) protein from sequence information alone using Auto-NOE-Rosetta and compared them with PDB deposited models determined with the help of chemical shifts (PDB ID 5WOT) [ 216 ] ( Fig. 10A and B ). From this comparison, it can be observed that the models obtained using chemical shift fragments have lower energies, exhibit higher convergence and are closer to the X-ray structure (PDB ID 1P01) ( Fig. 10A–C ). Further analysis of NOE assignments in the models for both scenarios revealed, as expected, very low (~30%) recovery of native residue pair contacts for sequence based fragments ( Fig. 10D ) relative to the contacts obtained using chemical shift fragments (~70%). Moreover, Auto-NOE-Rosetta successfully assigned approximately three times more NOE restraints per residue in the structural ensemble calculated using chemical shift fragments as opposed to those obtained using only the sequence fragments ( Fig. 10E ). This result stems from the sampling of native-like structures during the early stages of the protocol, which in turn helps assign more long-range NOE restraints. Hence, our comparison illustrates a high degree of synergy between chemical shifts and NOE structural restraints in driving CS-Rosetta structure calculations.

An external file that holds a picture, illustration, etc.
Object name is nihms-952934-f0010.jpg

Synergy between chemical shift-based fragments and automated NOE assignments. ( A ) Ten lowest-energy structures of 198 aa a-lytic protease protein computed by AutoNOE-Rosetta from fragments derived using the protein amino acid sequence together with manually assigned NMR backbone chemical shifts (PDB ID 5WOT) [ 216 ]. ( B ) Ten lowest-energy structures of the same protein computed by AutoNOE-Rosetta using fragments derived solely from the amino acid sequence. ( C ) Energy (in R.E.U., Rosetta Energy Units), RMSD (in Å) to X-ray structure (PDB ID 1P01) and Convergence (in %) of the ten lowest-energy structures computed by AutoNOE-Rosetta with (purple) and without (red) using chemical shifts (gray arrows). Convergence of the structures is as shown in the gradient scale to the right. ( D ) NOE contacts are defined as a function of residue pairs. Upper triangular region represents long-range (at least 5 residues apart) NOE contacts identified by AutoNOE-Rosetta for two independent calculations; first, performed by applying structural fragments derived from the protein amino acid sequence (red), and second, using chemical shift fragments (silver). The lower triangular region represents long-range NOE contacts predicted between all possible protons in the X-ray (PDB ID 1P01), using a 5.5 Å distance threshold (green). ( E ) Number of NOE restraints assigned by AutoNOE-Rosetta for each residue in the ten lowest-energy structural models computed with (silver) and without (red) using chemical shift-based fragments. Data obtained from [ 216 ].

11. New approaches to automated chemical shift assignments

The accuracy of all chemical shift-based structure modeling methods addressed so far depends, to a large extent, on both the correctness and completeness of the input chemical shift assignments. In recent years, there has been a surge of development in methods for automated chemical shift resonance assignments of both backbone and sidechain atoms, often with very high levels of accuracy (reviewed in [ 16 , 217 ]). Such algorithms have become integral components of structure calculation protocols employing NOEs and/or RDCs. While the majority of chemical shift assignment algorithms operate on the basis of a large number (6–10) of complementary NMR spectra, an effort towards reducing the number of input spectra is driven by the need to simplify and further automate the NMR structure determination process.

With this in mind, 4D-CHAINS, an automated procedure, was developed recently to assign backbone and sidechain chemical shifts using two complementary 4D (TOCSY and NOESY) spectra, recorded in fully protonated samples [ 216 ]. 4D-CHAINS uses 2D probability density maps of correlated 13 C– 1 H chemical shifts to identify spin systems (termed amino acid index groups or AAIGs) in the input 4D data. Thereafter, AAIGs are mapped to amino acids in the query protein sequence using a procedure similar to genome assembly used in DNA sequencing [ 218 ]. During this process, contiguous segments of AAIGs are iteratively matched along the protein sequence until a self-consistent assignment solution is obtained. The high levels of accuracy and completeness of 4DCHAINS ( Fig. 11A ) allow it to be combined with NOE assignment and structure determination algorithms, such as AutoNOE-Rosetta. The practical utility of the combined 4D-CHAINS/ AutoNOE-Rosetta protocol was demonstrated recently through the structure calculation of aLP ( Fig. 11B ) [ 216 ]. Here, 4D-CHAINS assigned chemical shifts together with two unassigned NOESY peak lists were provided as input to AutoNOE-Rosetta. The combined protocol (i) generated structural models within 1.3 Å from the reference X-ray structure (PDB ID 1P01) ( Fig. 11B ) and (ii) captured two-thirds of the crystallographic NOE contacts across the entire protein suggesting good recovery of near-native folds ( Fig. 11B–D ) [ 216 ]. Together, the 4D-CHAINS/AutoNOE-Rosetta approach forms a complete, automated pipeline for NMR structure determination from a minimal set of spectra.

An external file that holds a picture, illustration, etc.
Object name is nihms-952934-f0011.jpg

Automated chemical shift assignment and structure determination of a-lytic protease using 4D-CHAINS/AutoNOE-Rosetta. The use of 4D-CHAINS/AutoNOE-Rosetta pipeline [ 216 ] is illustrated for a 20 kDa, uniformly 13 C, 15 N-labeled protein with a highly complex β-fold topology. ( A ) 4D-CHAINS produces reliable assignments at completeness levels (~93%) which exceed the minimum required (~70%) by AutoNOE-Rosetta to converge on the correct fold using simulated peak lists. First, 4D-CHAINS assigns 77% of all observed backbone and sidechain chemical shifts using a 4D HC(CC–TOCSY(CO))NH experiment (dark green). Second, correct assignments are automatically extended by an additional 13% using common NOEs in a 4D 13 C, 15N-edited HMQC-NOESY-HSQC (Heteronuclear Single Quantum Coherence) experiment (light green). The full method has a combined 1.9% error rate (red), and does not consider the resonances of aromatic or sidechain amide groups (silver), which can be readily obtained manually using the automated assignments as a guide. ( B ) Ensemble of ten lowest-energy structures calculated using AutoNOE-Rosetta, superimposed on the X-ray reference structure (PDB ID 1P01). Average RMSD to X-ray: 1.3 Å (computed for backbone atoms over core secondary structure regions). ( C ) NOE contacts defined for residue pairs along the sequence of α-lytic protease. The upper triangular region represents NOE contacts identified by AutoNOE-Rosetta using chemical shifts assigned by 4D-CHAINS and two complementary 4D NOESY peak lists (HCNH and HCCH). The lower triangular region represents all degenerate 1 H NOE contacts predicted from the X-ray structure using a 5.5 Å distance threshold. ( D ) Comparison of the total number of NOE contacts between amide-amide, amide-aliphatic and aliphatic-aliphatic protons assigned by AutoNOE-Rosetta and NOEs predicted from the X-ray structure as described in ( C ). All structure diagrams were prepared using PYMOL ( https://pymol.org/2/ ).

12. Integration of other NMR structural parameters in chemical shift-based methods

Classical NOE-based approaches for NMR structure determination rely on the analysis of short- to medium-range (<6 Å) 1 H– 1 H distance restraints [ 219 ]. These local NOE connectivities are typically complemented with more global restraints obtained from measurements of RDCs, PREs and PCSs [ 220 , 221 ] during the final stages of structure refinement. More recently, the use of such ‘‘global” restraints, in conjunction with chemical shift fragment based approaches, has proven to be a powerful combination to alleviate or reduce the requirement of NOEs for modeling protein structures. Here, we briefly discuss the utility of RDC-, PRE-, and PCS-derived restraints in chemical shift-based structure determination.

RDC measurements report on global orientations between inter-nuclear bond vectors with respect to an overall alignment frame [ 31 , 220 , 222 ]. RDCs are highly sensitive structural parameters, therefore their application during structure refinement and validation can help not only to identify the overall protein fold, but also to pinpoint detailed structural features, such as the precise equilibrium length of bonds [ 223 , 224 ] or deviations from planarity in the peptide group [ 225 ]. However, these high resolution applications of RDCs are limited to smaller proteins. Normally, RDC restraints have been employed within de novo structure determination protocols of various levels of complexity [ 226 ]. Due to the degeneracy of RDC values with respect to the underlying orientation of inter-nuclear vectors, multiple independent datasets recorded using different alignment media are required in order to define a uniquely preferred orientation [ 226 ]. More recently, significant progress has been made in the development of automated structure determination approaches guided by chemical shifts and or sparse RDC restraints [ 7 , 226 ]. In all these methods, RDCs offer a highly complementary source of structural information to the backbone chemical shifts; while chemical shifts are very sensitive to the local backbone structure, RDCs help define long-range structural features, particularly the orientation of different secondary structural elements and individual domains within the structures of multi-domain proteins. This was recently demonstrated through RASREC-Rosetta calculations, where the use of amide RDCs in conjunction with backbone chemical shifts and sparse amide NOEs enabled structure determination of targets up to 25 kDa [ 81 ]. Finally, RDCs together with chemical shifts offer an opportunity for self-consistent cross-validation of NMR structures [ 227 ], which becomes particularly relevant in the face of sparse datasets.

PRE restraints are obtained through a quantitative analysis of 15 N and 13 C relaxation rates in samples containing paramagnetic tags, typically attached via site-specific labeling approaches, relative to a diamagnetic reference sample. Here, the conjugation of nitroxide spin labels to engineered disulfides in proteins has been particularly useful. Alternatively, solution PREs have been widely adopted for structure modeling applications allowing for de novo structure determination of large proteins (40–100 kDa) in the absence of abundant long-range NOE restraints [ 221 , 228 , 229 ]. For instance, solvent (s)PRE-CS-Rosetta [ 229 ] makes use of the global fold information encoded within sPRE restraints (i.e. distance measurements between a paramagnetic solute and the protein surface) and chemical shifts to model protein structures. In the sPRECS-Rosetta protocol, amino acid sequences together with NMR chemical shifts are used to generate backbone fragments, which are subsequently assembled to produce low-resolution structural models (see Section 4). These low-resolution models are further used to back-calculate the sPRE effect for comparison against experimental data, and additionally to compute the sPRE-based score which is used to adjust the energy function. This approach leverages a fast, grid-based method for sPRE computation during the low-resolution stage of ab initio fragment assembly. Thus, the use of sPRE restraints complements chemical shift-based fragments by biasing the collapse of the polypeptide chain towards more native-like conformations [ 229 ].

PCS measurements provide structural restraints derived by measuring changes in chemical shift values due to the presence of a paramagnetic metal ion [ 230 ]. In contrast to NOEs and PREs, which show a r –6 dependence, PCSs display a r −3 distance dependence, allowing for comparatively longer distance measurements between atoms (up to 40 Å) [ 221 ], in conjunction with their orientational dependence that can provide a further powerful source of structural discrimination. Therefore, PCSs are used by protein structure determination algorithms to obtain global fold information [ 231 – 234 ].While PCSs are extensively utilized during docking or structure refinement, their use is limited during de novo modeling because the tensor parameters used to calculate PCS distance restraints depend on atomic coordinates. PCS-Rosetta extends from Rosetta’s ab initio algorithm and makes use of chemical shift derived fragments together with a low-resolution energy function adjusted according to the PCS score (computed using experimental PCS data) [ 232 ]. The PCS-based score term is obtained by interleaving a grid search, which defines the position of the paramagnetic tag, with a singular value decomposition to fit the five tensor parameters. Following low-resolution stage, sidechains are introduced and refined using a full-atom energy function augmented by the PCS score. This protocol has been recently expanded to include paramagnetic tags located at multiple sites with the aim of enabling more robust structure determination of smaller proteins [ 233 ].

As highlighted by these efforts, the combination of RDCs, PREsand PCSs with local structural restraints obtained from NMR chemical shifts provides a powerful approach towards modeling protein structures with high accuracy, using very sparse or, in some cases, no NOE data. This can be a valuable tool for the study of membrane proteins [ 235 , 236 ], and proteins in the solid-state [ 237 ].

13. Accessibility and performance of chemical shift-based structure determination methods

All chemical shift-based structure determination methods discussed in this review are available publicly via web servers or downloadable software packages ( Table 2 ). Detailed manuals are available for most methods, rendering them easy to use by users with a minimal background in UNIX operating systems. Most de novo prediction methods employ fragment assembly to model monomeric or oligomeric protein structures, with various computational requirements owing to the complexity and parallelization of the corresponding protocols. While the fragment selection step itself has very modest computational cost (30–40 min on a commodity machine, depending on the target size), and can be run in parallel using the MPI build of Rosetta, ab initio structure refinement is a more demanding task.

Availability of a subset of the chemical shift-based methods used for structuredetermination discussed in this review.

MethodAvailability (website)Web
server
support
Platforms
supported
CHESHIREAvailable from authors
CS-Rosetta UNIX-based
systems
RASREC-Rosetta UNIX-based
systems
CS-MDAvailable from authors
CS-HM-Rosetta UNIX-based
systems
Pomona UNIX-based
systems
RASREC-Rosetta
with EC
restraints
UNIX-based
systems
EC-NMR UNIX-based
systems
(and
Windows
for several
steps of the
protocol)
RosettaOligomers UNIX-based
systems
CamDockAvailable from authors
CS-Rosetta-RNA UNIX-based
systems
AutoNOE-Rosetta UNIX-based
systems
sPRE-CS-Rosetta UNIX-based
systems
PCS-Rosetta UNIX-based
systems
GPS-Rosetta UNIX-based
systems
4D-CHAINS UNIX-based
systems
and
Windows

As a representative example, we compared total runtimes as a function of number of processors for the 20 kDa aLP target using three main approaches, CS-Rosetta, RASREC-Rosetta and AutoNOE-Rosetta. As input, these protocols were given amino acid sequence along with three- and nine-residue fragments derived using NMR chemical shifts [ 216 ]. In addition, two unassigned NOESY (4D HCCH and 4D HCNH) peak lists were provided to AutoNOE-Rosetta to perform NOE assignment alongside ab initio structure determination. CS-Rosetta and RASREC-Rosetta calculations were performed independently using manually assigned NOE constraints. We sampled a total of 10,000 independent CSRosetta structures, whereas RASREC-Rosetta and AutoNOERosetta generated 50–80 batches of 100 structures (depending on the progression of each protocol). The fragment-based approaches generally require 16 or more CPUs (of a commodity computer or a UNIX-based cluster) to yield structures in a reasonable amount of time ( Fig. 12 ). Due to the sampling bottlenecks, it is recommended to run such calculations on 64 or higher number of CPUs for larger (>200 aa) targets.

An external file that holds a picture, illustration, etc.
Object name is nihms-952934-f0012.jpg

Performance of chemical shift-based fragment assembly methods. Performance of CS-Rosetta (green), RASREC-Rosetta (orange) and AutoNOE-Rosetta (blue) for a 20 kDa target, aLP, given by their runtimes measured as a function of the number of processors (or CPUs) used for structure calculation. All the runs were carried out using sequence information, chemical shifts assigned by 4D-CHAINS, NOE restraints (for CS- and RASREC-Rosetta) and unassigned peak lists (for AutoNOE-Rosetta) as input. The points on the plot represent independent structure calculations performed by respective methods using various number of processors (16, 32, 64, 100 and 200). The y-axis shows time (in hours) taken by the methods for each calculation, which is bounded by the number of hours considered reasonable (~250 h or 10 days). For CS-Rosetta calculations, 10,000 structures are produced during each run. Similarly, for RASREC-Rosetta and AutoNOE-Rosetta calculations, 50–80 batches of size 100 are produced for every execution.

Comparatively, homology-based approaches carry out a preprocessing step to derive restraints or to identify templates from a set of evolutionarily related proteins. Representative times needed to generate restraints/find templates for aLP using CSHM-Rosetta, Pomona, and EVFold (for RASREC-Rosetta and ECNMR) are approximately 4, 400 and 300 min respectively on a single CPU.

14. Conclusions and future outlook

Ever since the first de novo atomic-resolution structure determined by NMR in 1985 [ 238 ], chemical shifts have remained an invaluable tool for spectroscopists towards examining the structure and dynamics of biomolecules for systems up to 1 MDa (Megadalton) in the solution- [ 174 ] and solid-state [ 239 ]. More recently, NMR methods have been applied to determine protein structures within living cells [ 240 ]. In this review, we have outlined a representative subset of several complementary approaches for chemical shift-driven structure determination. The active development of new algorithms and expansion of curated databases has the potential to further improve the robustness and accuracy of chemical-shift based methods, to complement and possibly replace classical methods of NMR structure calculation. While the sensitivity and versatility of chemical shifts for structure determination is highlighted by the sheer number and applicability of available approaches, information provided by chemical shifts alone is largely limited to a description of local geometry [ 4 ]. Thus, hybrid approaches that combine chemical shifts with additional short-range and long-range restraints, such as NOE [ 219 ], RDC [ 220 ], PRE [ 221 ], and PCS [ 221 ] measurements, are expected to further increase the scope, accuracy and resolution of NMR derived structures. The integrated approaches that incorporate NMR chemical shifts with other types of experimental data, such as SAXS [ 151 ], Cryo-EM (Cryo-electron microscopy) [ 241 ], SANS (Small Angle Neutron Scattering) [ 242 ], and EPR (Electron paramagnetic resonance) [ 243 ], will provide additional avenues for structure determination of larger and more challenging systems.

At the same time, automated methods have streamlined the chemical shift assignment procedure, allowing for structure determination of small to moderate sized proteins (up to 200 residues,~22 kDa) with minimal intervention by the user [ 7 , 217 ]. Progress in automated NMR structure determination will enable a more thorough description of the protein fold space, allowing for more accurate homology modeling and fragment generation. For systems of larger size and dynamic complexity, automated methods benefit from advances in selective isotope labeling schemes [ 244 ], the use of probes with favorable relaxation properties, such as methyl groups [ 245 ], and utilization of sparse restraints [ 94 , 104 , 128 ]. Here, highly-parallel, iterative protocols, such as RASREC-Rosetta, can lead to a drastic improvement in sampling efficiency and accurately determine near-native structures from sparser datasets. Knowing that sequence covariance can provide sufficient long-range information to model the folds of mediumsized proteins, several research groups have moved on to incorporate evolutionary information which drastically reduces the computational costs required by more data-oriented approaches. With the advent of sparse data recorded for larger systems, the mining of evolutionary information from genome sequencing and the fine-tuning of sidechain conformations according to chemical shift data, the next generation of methods will aim to deliver a more accurate view of biomolecular structures and their dynamics towards a new renaissance in structure determination by NMR methods.

Acknowledgements

The authors would like to thank Oliver Lange, Robert Vernon, Yang Shen, Jinfa Ying, Paolo Rossi, Flemming Hansen, Kostas Tripsianes, David Baker and Ad Bax for helpful discussions over the years. This manuscript was supported in part by funds from the Intramural research program of the NIAID, NIH, a K-22 Career Development and an R35 Outstanding Investigator Award to N.G.S. through NIAID (AI112573) and NIGMS(R35GM125034), respectively. Research reported in this publication was supported by the Office Of The Director, NIH, under Award Number S10OD018455.

Glossary of abbreviations

ÅAngstrom
µsmicrosecond
aaamino acid
aLPα-lytic protease
AAIGamino acid index group
ARIAambiguous restraints for iterative assignment
AutoNOE-Rosettaautomatic NOESY assignment-rosetta
BLOSUMblock substitution matrix
BMRBbiological magnetic resonance bank
CASD-NMRcommunity wide assessment of NMR structure determination
CASPcritical assessment of methods of protein structure prediction
CESTchemical-exchange saturation transfer
CHEOPSchemical shift de novo structure derivation protocol employing singular value decomposition
CHESHIREchemical shift restraints
CMVcytomegalovirus
COSYcorrelation spectroscopy
CPMGcarr-purcell-meiboom-gill
Cryo-EMcryo-electron microscopy
CS-HM-Rosettachemical shift-homology modeling-rosetta
CS-Rosetta-RNAchemical shift-rosetta-ribonucleic acid
CS-Rosettachemical shift-rosetta
CS-RosettaCMchemical shift-rosetta comparative modeling
CSAchemical shift anisotropy
CS-MDchemical shift restrained molecular dynamics
CYANAcombined assignment and dynamics algorithm for nmr applications
DFTdensity functional theory
DIdirect information
DSSPdatabase of secondary structure assignments
ECevolutionary coupling
EC-NMRevolutionary coupling-nuclear magnetic resonance spectroscopy
EPRelectron paramagnetic resonance
EVFoldevolutionary fold
FARFARfragment assembly of RNA with full atom refinement
HADDOCKhigh ambiguity driven docking
HHSearchHMM-HMM search
HIVhuman immunodeficiency virus
HMMhidden markov model
HMQCheteronuclear multiple quantum coherence
HSQCheteronuclear single quantum coherence
I-TASSERiterative threading assembly refinement
Igimmunoglobulin
ILVisoleucine leucine valine
kDAkilodalton
MDamegadalton
MFRmolecular fragment replacement
MPImessage passing interface
msmillisecond
MSAmultiple sequence alignment
NESGnortheast structural genomics
NMRnuclear magnetic resonance
NOEnuclear overhauser effect
NOESYnuclear overhauser effect spectroscopy
NUSnon-uniform sampling
PCSpseudocontact shifts
PDBprotein data bank
PISCESpublic server for culling sets of protein sequences from the PDB by sequence identity
PLMpseudo likelihood maximization Pomona: protein alignments obtained by matching of NMR assignments
PREparamagnetic relaxation enhancement
PROMEGAproline omega angle prediction
RASREC-Rosettaresolution adapted structural recombination-rosetta
RDCresidual dipolar coupling
R.E.U.rosetta energy units
RMregulatory module
RMSDroot mean square deviation
RNAribonucleic acid
RosettaCMrosetta comparative modeling
RTTregulator of Ty1 transposition
SANSsmall-angle neutron scattering
SAXSsmall-angle X-ray scattering
SCOPstructural classification of proteins
SHsrc homology
SPARTAshifts predicted from analogy in residue type and torsion angle
sPREsolvent paramagnetic relaxation enhancement
SrtAsortase A
STRIDEstructural identification
SWAstepwise assembly
TALOStorsion angle likelihood obtained from shift and sequence similarity
TOCSYtotal correlation spectroscopy

Conflict of interest

The authors declare that they have no conflict of interest.

IMAGES

  1. NMR Chemical Shift Values Table

    nmr chemical shift assignment

  2. Chemical Shifts in Proton NMR Spectroscopy

    nmr chemical shift assignment

  3. Assignments of chemical shifts of metabolites in the 1H NMR spectra of...

    nmr chemical shift assignment

  4. 1H and 13C NMR Chemical Shifts of Pd(II) and Pt(II) Compounds z H NMR

    nmr chemical shift assignment

  5. Ch 13

    nmr chemical shift assignment

  6. Analytical Chemistry

    nmr chemical shift assignment

COMMENTS

  1. A guide to small-molecule structure assignment through ...

    Of the three important classes of primary NMR data—chemical shifts, coupling constants and relative integrated signal intensity—the first is the most diagnostic of the local chemical and...

  2. 12.5: Functional Groups and Chemical Shifts in ¹H NMR ...

    The proton NMR chemical shift is affect by nearness to electronegative atoms (O, N, halogen.) and unsaturated groups (C=C,C=O, aromatic). Electronegative groups move to the down field (left; increase in ppm).

  3. Time-optimized protein NMR assignment with an integrative ...

    Chemical shift assignment is vital for nuclear magnetic resonance (NMR)–based studies of protein structures, dynamics, and interactions, providing crucial atomic-level insight. However, obtaining chemical shift assignments is labor intensive and requires extensive measurement time.

  4. 13.3: Chemical Shifts in ¹H NMR Spectroscopy - Chemistry ...

    1 H NMR Chemical Shifts. Chemical shifts in NMR (Nuclear Magnetic Resonance) spectroscopy refer to the phenomenon where the resonant frequency of a nucleus in a magnetic field is influenced by its chemical environment.

  5. 13.4: Chemical Shifts - Chemistry LibreTexts

    The chemical shift of an NMR absorption in δ units is constant, regardless of the operating frequency of the spectrometer. A 1 H nucleus that absorbs at 2.0 δ on a 200 MHz instrument also absorbs at 2.0 δ on a 500 MHz instrument. The range in which most NMR absorptions occur is quite narrow. Almost all 1 H NMR absorptions occur from 0 to 10 ...

  6. Rapid protein assignments and structures from raw NMR spectra ...

    The researchers in this work present a deep learning-based method that delivers signal positions, chemical shift assignments, and structures of proteins within hours after completion of the...

  7. Chemical shift-based methods in NMR structure determination

    Chemical shifts are highly sensitive probes harnessed by NMR spectroscopists and structural biologists as conformational parameters to characterize a range of biological molecules. Traditionally, assignment of chemical shifts has been a labor-intensive process requiring numerous samples and a suite of multidimensional experiments.

  8. A Very Deep Graph Convolutional Network for 13C NMR Chemical ...

    Herein, we have constructed a 54-layer-deep graph convolutional network for 13 C NMR chemical shift calculations, which achieved high accuracy with low time-cost and performed competitively with DFT NMR chemical shift calculations on structure assignment benchmarks.

  9. Complete 1H and 13C NMR chemical shift assignments of mono-to ...

    The 1 H and 13 C NMR chemical shifts of compounds 1 – 57 are given in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6 and below we highlight some aspects of the assignments and structural features of the compounds analyzed.

  10. Accurate Prediction of NMR Chemical Shifts: Integrating DFT ...

    Computer prediction of NMR chemical shifts plays an increasingly important role in molecular structure assignment and elucidation for organic molecule studies.