Proteins are the building blocks of all life. They fold into a highly specific three-dimensional structure from a sequence of amino acids. Their structure determines their function, which can range from industrial uses like food storage to developing novel drug therapies. Until now, designing a protein structure based on a sequence from scratch has been a challenge. 

A research group including U of T’s Philip M. Kim, a professor at the Terrence Donnelly Centre for Cellular and Biomolecular Research and the Departments of Molecular Genetics and Computer Science, has recently made advances in solving this problem. 

Rather than testing all possible sequences, the group members start from an existing three-dimensional structure and try to predict possible amino acid sequences using machine learning techniques. The results were published in the journal Cell Systems.

From sudoku to protein design

One of the difficulties of designing proteins is the vast number of possible sequences. There are 20 amino acids for living organisms to choose from, and the number of random protein sequences that are 200 amino acids long would exceed the number of atoms in the galaxy.

Researchers began by thinking of protein design as a type of constraint satisfaction problem (CSP). “In simple terms, [CSPs] are puzzles where we are given some variables and our goal is to assign one out of a predetermined set of values to each of those variables while obeying certain conditions,” wrote Alexey Strokach, lead author of the paper and PhD student, in an email to The Varsity

“One commonly-used example of a CSP is the map colouring problem, where we are asked to assign a colour to every country on a map such that no two neighboring countries have the same color.”

The trick was to design an algorithm that could solve a well-established CSP, before extending it to protein design. In this case, the researchers chose sudoku as their initial target, which has the advantage of having a definite solution. “It is much easier to generate the training data and to train and evaluate models to solve a toy CSP, such as Sudoku, than to train a model for protein design,” Strokach wrote.

The constructed algorithm was able to correctly solve up to 72 per cent of a given sudoku puzzle on a first try, with accuracy going up to 90 per cent after multiple attempts. The challenge was to teach an algorithm to “play” sudoku on millions of protein sequences and design new sequences from scratch. 

Using neural networks

The final algorithm, called ProteinSolver, grew out of an understanding that protein design could be modelled as a graph — points or nodes connected by lines — from which patterns can be inferred by a machine learning algorithm called a neural network. 

With the amino acids in a protein structure standing in for nodes and the separation between acids represented as lines in the graph, ProteinSolver was able to correctly replicate known protein sequences. Eventually, it was able to generate novel sequences for given protein structures.

To test whether their predicted sequences work in reality, Albert Perez-Riba, a postdoctoral fellow at the Terrence Donnelly Centre for Cellular & Biomolecular Research, expressed and purified these proteins in the lab. Then, Perez-Riba probed the secondary structure through experimental methods and found that they match the predictions from the algorithm.

Although the predicted protein sequences in this paper are not believed to have practical applications, there are numerous practical applications of the algorithm beyond this initial publication. 

According to Strokach, these include generating variants of known proteins for industrial use. “In the pharmaceutical industry, this could be used to improve the shelf-life of biologics, while in industrial process engineering, this could be used to increase the activity of biocatalysts in inhospitable conditions,” he wrote.

The researchers have made both their code and program available for public use.