Half an explanation of protein folding

**Trekkin** · 2017-05-26, 07:06 PM (ISO 8601)

Someone on the folding@home thread asked what’s going on in the visualization, and my answer kind of grew as I added more explanations and examples until it became kind of a general overview of protein folding. I make no claims about its particular applicability to folding@home, not being associated with the lab behind it, but it’s as accurate as I can get it while still being comprehensible and mostly math-free. I hope this is thorough enough, and maybe other people were curious as well.

Why we fold proteins:

Spoiler

Show

It is traditional to start describing a scientific thing by explaining why anyone should care about it, so I'll start listing reasons we want to know how proteins fold: Sickle cell. Alzheimer's. Kuru. Parkinson's. Huntington's. Glaucoma. ALS. Type II diabetes. Cataracts. Every single one of these is directly traceable to a protein folding incorrectly. Indirectly, every virus anyone has ever had is nothing more than a precisely folded protein coat containing its genetic material, and every antibody that confers resistance to one has interacted with those proteins in a highly specific way. Even more generally, protein folding is how the vast majority of missense mutations affect people -- and it's how the beneficial mutations and useful metabolic pathways we build into organisms for synthetic biological purposes go from genetic material to functioning chemical changes. Proteins matter, and it's not much of an exaggeration to say that a fast, general solution to go from a one-dimensional protein sequence to a three-dimensional protein structure would be one of the most significant scientific developments in human history to date. It would let us design enzymes to make nearly any compound we could want (including biofuels, medically useful small molecules, etc.)and hand us the tools to understand the biochemistry behind every disease we know -- and the means to design biosensors to diagnose them and therapeutics to treat them.

If none of that sounds worthwhile, then I guess protein folding won't either, but the above is part of why we spend so much time building protein folding programs and invite the public to donate CPU time to the protein folding problem and/or try to fold them themselves.

Why proteins need to be folded:

Spoiler

Show

You might be wondering why the protein folding problem is such a big deal while the DNA folding problem is not as commonly discussed. The reason has nothing to do with the relative importances of the molecules or anything like that; the structure of DNA is a hugely important determinant of how genes are regulated, but "the structure of DNA" is also a relevant concept; that is to say, DNA almost always adopts one of three kinds of double-helix structure regardless of sequence, so we don't necessarily need to model the structure of any given sequence. Similarly, many other biological molecules like carbohydrates and lipids are small, so it's not that hard to just go through all the relevant conformations they can adopt one by one.

RNA structure prediction is still really hard, but it should be pointed out that, like DNA, RNA contains only four kinds of nucleobase; it has the flexibility to fold into really complicated shapes, but they're still constrained primarily by base pairing. Folding RNA is as much a topological problem as a chemical one, but there are somewhat fewer rules to remember.

Proteins, though, can vary wildly in the chemical properties of their subunits. When they're made, they're an effectively one-dimensional sequence of (generally) 19 amino acids and one imino acid; this is termed the protein's primary sequence. Among their number are acids, bases, and structures both large and small that interact with water and each other in countless ways -- all while attached to a backbone that is itself composed of positive and negative charges and weird one-and-a-half bonds that we'll get into later. In order to properly model them, we need a working understanding of quantum electrodynamics or at minimum molecular geometry. Then we need to discard most of that understanding because it's too math-intensive. We can comfortably model dozens or perhaps a few hundred atoms of condensed matter properly with the supercomputers currently available to us; even a small protein is far bigger than that.

So, in short, protein folding is a subset of protein structure prediction, the science of using supercomputers to cheat at quantum physics for biomedical purposes.

How proteins can be folded:

There's a couple of ways to go about doing this, but for our purposes we need to consider only two: molecular dynamics and the constellation of scoring-based Markov and Monte Carlo methods. Explaining either of them requires that we start by looking at the geometric constraints on molecules -- and rather than tell you about them, I'd rather show you.

For that, we need PyMOL; it's free, runs on just about anything, and it will show you proteins. Instructions in the spoiler:

Spoiler

Show

Once you have it installed, open it up and find the little white text box that serves as the console. Then type the following: fetch 1a2x
You should see a line representation of 1A2X.pdb, which is a model of troponin C and which will look rather confusing. Let's make it prettier. Run

show cartoon
util.chainbow("1a2x")
util.cnc 1a2x
show sticks

and click the A next to all, then select hydrogens->add
also click the S in the lower right corner to show the sequence.

It is, at the very least, more colorful now; what we have done is to color the carbon atoms according to where in the protein's primary sequence they are while coloring everything else by element, show the secondary structure elements in cartoon mode and then thicken the lines into sticks so it's easier to see the atoms. Now feel free to click and drag to spin the molecule around; you should see some helices and a very small blue-green pair of arrows. These are secondary structure elements, which we will get into later. For now, just look at how complicated and irregular this fold is; unlike DNA, a single protein molecule can have its amino acids adopt many orientations relative to each other. This protein also has multiple chains, again unlike DNA, which you can see now that we've colored the N termini of all the chains blue. This further complicates the folding problem, as you might imagine; while the subunits of this particular protein bind and unbind in the cell, many others are mostly-permanently locked into a given complex and need each other to fold. If we had to guess every position of every atom in a given protein, we'd be here forever; this is related to the Levinthal paradox, in which a randomly folding macromolecule the size of a protein cannot be guaranteed to reach a particular conformation within the lifetime of the universe, so proteins are obviously not randomly folding. Even so, the number of possible conformations is huge -- but we have some help narrowing them now.

Let's zoom in on a single amino acid: click on the D in position 2 of the A chain of 1A2X. Now, under the "sele" object, click to zoom, center, and orient, then color by element and color the carbons green. The atoms colored with pink squares are a single amino acid, the fundamental subunit of proteins. You can see that there is a carbon in the middle, called the alpha carbon, bonded to which are a single white hydrogen, a blue nitrogen, and a beta carbon bonded to a red oxygen. All of these are backbone features, and will be constant from alpha carbon to alpha carbon. The fourth thing is the side chain, here a CH2COO-. This varies from amino acid to amino acid.

Use the measurement wizard (wizard->measurement) to measure some distances and angles by clicking on pairs of atoms and reading the distance between them; you can switch to angle mode by clicking the top button below "measurement."

Two things will become apparent as you do explore polypeptide interatomic distances and angles:

1. The distances between atoms bonded together are startlingly regular.
2. The distances between more distantly connected atoms are anything but.

The same is true of angles. This is because these connections are governed by the physics of covalent bonds, which dictate that there is a relatively narrow range of energetically optimal distances between two bonded atoms. (This does vary by bond type and the elements involved, but it holds true if those are constant.) You can look up the Lennard-Jones potential if you like, but in brief it dictates that the strength of a bond, described as the energy required to break it (or, equivalently, the depth of the bond's potential well), is strongest at a certain minimum distance. Atoms any closer together are forced apart by Pauli repulsion as the electrons repel each other; atoms farther apart are drawn together to form more energetically optimal valence shell geometry. Likewise, the angle between two bonds is generally controlled by their relative strength and the hybridization of the molecule at the vertex; carbon-carbon and carbon-hydrogen bonds are almost always at the tetrahedral 109.5 degrees typical of sp3 bonds, while nitrogen, being sp2 hybridized, keeps its bonds a planar 120 degrees apart.

So if distances and angles don't really vary, something else must be giving rise to the incredible variation in protein structure: the dihedrals, or angles between bonds. Four atoms define a dihedral, which is the angle between the first and fourth atom while superimposing the second and third atoms. Three such angles are important to us.

1. The phi angle, for which atom 1 is the beta(or carbonyl) carbon of the previous residue, atom 2 is the backbone nitrogen of the current residue, atom 3 is the alpha carbon, and atom 4 is the carbonyl carbon.
2. The psi angle, which uses the same atoms at the phi angle except the carbonyl carbon of the previous residue is replaced by the nitrogen of the next residue.
3. The chi angle, which is actually a set of angles defining the dihedrals between backbone carbons on side chains.

There is a fourth dihedral, the omega angle, but it is rarely considered because it does not vary; the peptide bond draws some of the carbonyl group's double-bond character across to the nitrogen, and it's double-bond enough to avoid bending except under extreme, non-physiologically-relevant levels of strain. Proteins with variable omega angles are probably on fire, frankly.

So we can reduce the problem of folding a protein to finding the optimal values for its phi, psi, and chi dihedrals. This is still a huge number of variables, though; a protein that is X amino acids long has 2X-2 backbone dihedral angles, plus at least X chi angles for all its non-glycine, non-alanine amino acids (prolines don't have one either, but they're their own kind of weird.) When you consider that many proteins are over a hundred amino acids long, it becomes clear that exhaustively searching every combination of angles is not feasible; this has been called Levinthal's Paradox. Briefly, a 100-amino-acid protein sampling one conformation per nanosecond would take far longer than the known age of the Universe to consistently select the single optimal set of dihedrals at random.

All of this sampling is pointless without some way of identifying the optimal configuration when we see it, though. That is the job of scoring, which is where we get into the cheating at quantum physics I mentioned earlier: typically, a given protein conformation's score is the sum of a bunch of different score terms multiplied by scalar "weights" usually determined by some neural net or benchmarking protocol. Those terms are meant to approximate molecular quantum mechanics into something a computer can rapidly work through for very large molecules, mostly by breaking them down into things that can be summed across finite-sized sets of atoms and then constraining the number of those sets to exclude computationally expensive but physically unimportant interactions.

First, we have to make sure electrons stay the right distance apart from each other; these are usually called something like steric terms or Lennard-Jones terms. I say "first" not because the order of these computations matters but because the weight of these terms is normally huge, as you might expect the magnitude of Coulomb's constant.

The next most frequently upweighted set of terms have to do with electrostatics, with one term representing the van der Waals force and another (or, more often, a set of them) attempting to account for hydrogen bonds. Briefly, atoms with different electronegativities will change the distibution of their electrons asymmetrically when bonded together, resulting in partial positive and negative charges on those atoms. Additionally, atoms with lone pairs (oxygen and nitrogen, usually) also have electrons distributed anisotropically around them. If you have ever heard "like dissolves like" or tried to figure out why oil and water "repel" each other, this is why: because it is energetically favorable for water to form as many hydrogen bonds as possible, and it cannot do that with oil, which is mostly made of hydrocarbons that aren't very polar at all.

So we have one term to account for charge interactions generally and encourage all the charges to be on the surface of the protein while another term tries to prioritize making those hydrogen bonds, because despite the common physical underpinnings they need to be coded and thus weighted differently. We can look at them in PyMOL, too:

Spoiler

Show

Scoring is somewhat beyond what PyMOL can do out of the box, but it can give you a sense of the last two. If you've got PyMOL, fetch something with a lot of helices (the squiggly lines) and some beta sheets (the arrows); 1P1R will do nicely. Select and delete three of the chains, show the cartoon as above, add the hydrogens (A->hydrogens->add) and then find backbone-backbone polar contacts (A->find->polar contacts->just intra-main-chain) and look at all the dashed yellow lines. You might notice that, for example, there are bonds running up and down the helices, as well as bonds between the strands; those bonding structures actually define alpha helices and beta strands, which are two of the three main components of protein secondary structure. The other is random coil or loop, which isn't necessarily very structured at all.

There are two things to note here: one, alpha helices are defined by residue i bonding to residue i+4, while beta sheets have this sort of 1-to-1 bond structure that's hard to define in similar terms because the strands vary in length. We can, however, say that parallel beta sheets (all the arrows going the same way) are bonded such that if i is bonded to j, i+1 is bonded to j+1; the inverse is true of antiparallel beta sheets. 1P1R has both. Two, PyMOL can be over-zealous in finding hydrogen bonds, like with all the ones across the alpha helices. This is because it is not trivial to identify which hydrogens involve atoms already "taken"; only one donor may bond to one acceptor and vice versa, but trying to enforce this leads to ambiguous situations.

While we're here, let's look at steric clashes. For this we will need the mutagenesis wizard. Select and zoom to the glycine at position 44, then open the wizard, click the glycine, and where it says "no mutation", select TRP, or tryptophan.

All those red things are indicators of where two atoms are too close together; you can see that we cannot easily fit a tryptophan here. Let's try something smaller, like alanine (ALA)...and hey, no (egregious) clashes. Can we fit something larger? No. But it might be educational to try.

Try a serine (SER), and this time look at the State counter in the lower right. Click on the forward and back buttons (the arrowheads) or the Play button (the triangular arrow) and watch that hydroxyl group spin around...and clash everywhere, but at least it's clashing in different places each time. What you're looking at are different rotamers, or rotational isomers, of serine: different sets of chi angles (or, in this case, different values for one chi angle.) They exist because not all possible chi angles are equally likely; several superimpose atoms on top of each other or are otherwise unfavorable. Most programs which edit chi angles directly do so with reference to rotamer libraries formed from large numbers of structures, and this is one of the ways in which protein folding and protein design are the same problem: if I'm designing, I can just pick a rotamer of any amino acid I want rather than whatever one is already there, bearing in mind that different amino acids are more favorable for different phi and psi angles.

At this point score functions diverge as their functions change; some attempt to integrate biological knowledge about which rotamers are more common than others, for example. Most of them include some term to check for phi and psi angles outside of the Ramachandran plot of the most commonly represented (and most physically possible) phi/psi angle combinations by residue. The rest tend to reflect details of the implementation of the score function rather than any specific biophysical relationship.

So now we have a sense of what we can change from one protein configuration to another and a way to evaluate two configurations relative to each other; these are sampling and scoring, respectively. All we need now is a way to govern those changes...and it is here that all the constants stop applying. Rather than speculate about what a particular home folding program uses (since the details are closed-source to end users anyway), I'll just list the various ways of sampling conformations, bearing in mind that most of them are modular enough that any one application might use any or all of them.

The protein folding problem itself, and most really big problems that start with no structural data whatsoever, usually make use of fragment folding: we take the protein structures we already have, cut them up into short fragments, and change the structure of our unknown protein to conform to fragments with the same sequence. This has the advantage of being completely based on the input sequence, but it takes forever and different structures can vary wildly; switch out a fragment in the middle of your sequence and suddenly half your protein is somewhere completely different. Scoring helps eliminate the really bad ones, but there's all kinds of sampling abnormalities and problems with how coarse-grained this is that make fragment folding alone best as a first step.

With some structural data, or the structural data of proteins similar in sequence, we can do homology modeling; this is like fragment folding except that one fragment might be most of the protein. This works best for very similar proteins, but it shares fragment folding's inability to pick out subtle differences in folds and its constraints to known folds. Of course, one can homology model and then try to fragment fold in the gaps, too.

Going finer-grained than that requires making some decisions about time. See, it's entirely possible to ignore time altogether in these simulations (at least indirectly) and just assume one is modeling the equilibrium state that will persist forever until something perturbs it and make random or biased quasi-random changes to dihedrals without trying to emulate the native folding pathway. Usually this involves using a Monte Carlo approach, which is common to problems where evaluating a solution is fast but there are many possible solutions. We score the thing, we make a move, and we re-score the thing. If the new score is better, we keep the new thing; if it is worse, we undo that move some of the time; the worse the change, the more often we undo it. This is because we might be making a move that is itself unfavorable but might with other moves be collectively better than we started, so we want to give weird moves a chance. Of course this approach is highly random and must be run over and over again to give us the best chance of finding the right fold, but it does have the advantage of being highly tunable. In its way, Foldit involves teaching its users how to think like a Monte Carlo algorithm with a more nuanced awareness of the set of possible moves than it might otherwise be feasible to program.

We can also have a loose, indirect awareness of kinetics and time through a Markov chain, which is essentially a graph of every fold the protein can adopt (that passes some probability threshold) with edges that are weighted by how likely the transitions are between those states. Folding@home actually explicitly mentions using these, and they have also been used in sequence alignment problems and similar cases where we can make clear distinctions between states. It is extremely parallelizable and very amenable to cleverness in doing so, as their FAQ makes clear, but it also faces the same pitfalls that can affect machine learning algorithms: errors that creep into the initialization of the graph can compound rather than cancel, so it is more sensitive to initial conditions than a less pre-computed approach.

Probably the most explicitly chronological approach I know is molecular dynamics, or MD, which attempts to actually simulate, at some discrete time step, the folding of a protein by calculating the displacement of each atom as a function of the forces upon it, then moving them all at once and repeating. This, as you might expect, is so hugely math-intensive that it has only recently become more feasible to do for protein-sized molecules, and like the Markov approach above it depends on your confidence in the underlying kinetic parameters -- but even more so. It can, however, offer unprecedented insight into transient states of proteins, and it is good at representing proteins that exist as ensembles of very different states (so it can help model misfolding or high-temperature simulations.)

So, to finally answer the question "what is happening in the visualization", it is this: a macromolecule, in this case a protein or part of one, is quasi-randomly perturbed and subsequently scored in order to weight the liklihood of it passing through the resultant configuration along its folding pathway. This data and others is then used to generate and prioritize computation of the next iteration of the map of possible folding pathways, and so on until likely pathways from unfolded chain to a folded state emerge.

I hope that helps.

**Bohandas** · 2017-06-11, 11:29 AM (ISO 8601)

Originally Posted by Trekkin

RNA structure prediction is still really hard, but it should be pointed out that, like DNA, RNA contains only four kinds of nucleobase; it has the flexibility to fold into really complicated shapes, but they're still constrained primarily by base pairing. Folding RNA is as much a topological problem as a chemical one, but there are somewhat fewer rules to remember.

Actually I think there's like half a dozen additional bases RNA can wind up with beyond the four it starts with when first transcribed

**Trekkin** · 2017-06-11, 12:08 PM (ISO 8601)

Originally Posted by Bohandas

Actually I think there's like half a dozen additional bases RNA can wind up with beyond the four it starts with when first transcribed

Certainly, but I didn't want to get into the details of something entirely ancillary to my point for the same reason I didn't mention glycans. Post-translational modifications are part of the other half of the explanation of protein folding.

**m9p909** · 2017-06-24, 10:23 AM (ISO 8601)

Thanks for answering my question!

I'm still trying to understand since I'm lacking a lot of knowledge, I haven't studied much biology, but I could definitely study more.

**Lacuna Caster** · 2017-06-26, 09:04 AM (ISO 8601)

Thanks for the post- I was doing a project recently using hidden markov models and I'd heard they were being used in genetics, so it's interesting to see a weighing of the pros and cons of different approaches.

Where would you actually get the data for setting up the transition probabilities in such a model, if I wanted to set up a viterbi trellis and run some basic simulations?

.

**Trekkin** · 2017-06-26, 02:28 PM (ISO 8601)

Originally Posted by Lacuna Caster

Thanks for the post- I was doing a project recently using hidden markov models and I'd heard they were being used in genetics, so it's interesting to see a weighing of the pros and cons of different approaches.

Where would you actually get the data for setting up the transition probabilities in such a model, if I wanted to set up a viterbi trellis and run some basic simulations?

No problem! Glad it helped.

Now, as to your question: the data set that informs your transition probabilites depends a lot on what your states are, as you might expect. I'll try to give an overview, but in all cases the simulations you'd run would depend heavily on the problem you want to solve.

Generally speaking, you can get transition probabilities for structural Markov models the same way you get them for sequence-based Markov models: by mining large sets of data on native molecules.

You probably know how one can most simply build a hidden Markov model out of a set of sequences: pick a reference sequence, write out a set of states corresponding to insertions, deletions, and neither for each position in that sequence, write out the emission probability for each state, and run every sequence in the set through the model, incrementing probabilites from some nonzero minimum as you go. What you have at the end can evaluate the odds of a particular sequence of mutation events being behind a given sequence. In theory, such a model could be naively paramaterized from aggregate data on global transiton/transversion/indel probabilities, although I'm struggling to come up with a compelling reason to do that to solve the kinds of phylogenetic problems we usually use HMMs for. Regardless, you can see how one could do the same using a PAM or BLOSUM matrix when dealing with amino acid sequences. Doing that manually was an excruciatingly boring part of my graduate class in sequence analysis, by the by.

Anyway, for structural data we have a richer dataset in the form of crystal structures and nuclear magnetic resonance data. If you're trying to do something like packing side chains, for example, you might build an HMM where the state sets are residues and the transition probabilies are drawn from some backbone-sensitive rotamer library; full-on de novo folding is (much much MUCH) harder, but could potentially start with the secondary structure propensities of different amino acid residues. If you wanted to do something more naive, you could potentially pull phi/psi/chi angle probabilites from the crystals directly, but you'd almost certainly want to bin those based on something else since they're so context-dependent. Maybe not for proline.

Regardless, you could build a structural HMM for folding where the transitions reflect the probability that a given residue's dihedrals are suboptimal and the emission arrays give you the actual dihedrals, the latter of which you could get from a Ramachandran plot or rotamer library. The former you could get from fragment libraries, I suppose, and then modify based on local connectivity to encourage the refolding of more isolated residues, ceteris paribus. Naturally you'd only build one state set at a time, since these would be huge.

Do note that this isn't how folding@home works specifically; there's more neural network processes in there than this naive approach would suggest. This is how I would approach trying to build an explicitly HMM-based folding program (as opposed to more generic Monte Carlo software) to fold things a priori quickly enough to run in reasonable time on hardware a private individual could easily put together without needing to get specific about the exact sequence. The output will likely not be great; this is, as you say, "basic." You'd want to score the outputs and refine the transition probabilities accordingly to bias it toward emitting more folded structures.

**Lacuna Caster** · 2017-06-27, 06:22 AM (ISO 8601)

Hmm. I suspect that properly understanding that answer is still a year or two of study beyond me, though I found some notes on the BLOSUM62 matrix here, which can apparently be matched up with amino acid initials here? But as you say, that simply tells me about the odds of a raw sequence being modified in a quasi-random fashion, not how it would fold.

There are some notes on generating Ramachandran plots in python here, which is nice, but the reference data set seems to be concerned with complex proteins rather than the bases themselves. Is there a particular fragment library you'd recommend I look at, if I wanted to extract the corresponding PDB files?

**Trekkin** · 2017-06-27, 10:17 AM (ISO 8601)

Originally Posted by Lacuna Caster

Hmm. I suspect that properly understanding that answer is still a year or two of study beyond me, though I found some notes on the BLOSUM62 matrix here, which can apparently be matched up with amino acid initials here? But as you say, that simply tells me about the odds of a raw sequence being modified in a quasi-random fashion, not how it would fold.

There are some notes on generating Ramachandran plots in python here, which is nice, but the reference data set seems to be concerned with complex proteins rather than the bases themselves. Is there a particular fragment library you'd recommend I look at, if I wanted to extract the corresponding PDB files?

Yeah; I mentioned it by way of analogy with DNA sequence alignment HMMs, but all a BLOSUM or PAM matrix can tell you is relative odds of a mutation changing one specific amino acid to another, averaged across many proteins, with certain underlying assumptions about how much the sequences have mutated (since, as you can imagine, unfavorable mutations get relatively more common the more mutation you allow; even something extreme like a gly->trp is possible if other mutations have created a void big enough for a tryptophan to fit.) It knows nothing of structure.

Fragment libraries aren't really a fixed thing like rotamer libraries; there's not necessarily that much statistical metadata to them outside of the coordinates and sequences themselves, so I've always just made my own for whatever purpose I need from collections of PDBs. Top500 is as good a place to start as any if there isn't a specific problem you want to solve, although one of the bigger topX libraries will give you better sequence coverage. Just figure out an efficient way to look up structural references by sequence and randomly choose a fragment corresponding to your unknown sequence for each window of whatever length you choose.

Your real problem is going to be scoring, since fragments are inherently local and threading onto them can easily cause distant parts of your protein to clash with each other. You could do some kind of coarse-grained check to make sure all the inter-C-alpha distances are above some cutoff, though, which would find a lot of the more egregious errors.

As for what all this has to do with HMMs...it doesn't. This is the naive approach to fragment insertion, the one HMMs can optimize over many nodes but are more overhead than they're worth under a certain size. It's also reinventing the wheel in a major way, which makes for a good programming exercise.

Thread: Half an explanation of protein folding

Thread Tools

Spoilers