The genomics revolution has resulted in the sequencing of the entire human genome, which has produced vast amounts of data. This data holds huge potential for understanding, among other things, the genetic basis for disease. However, genetic data is inherently complex and challenging to interpret.

PhD students Hui Yuan Xiong and Hannes Bretschneider and postdoctoral fellow Babak Alipanahi from  U of T’s Probabilistic and Statistical Inference Group, along with more than 10 other researchers, worked to build a computer model that can predict the effects that mutations in the genome would have on splicing, the malfunction of which often leads to disease. Splicing is a critical biological process in higher organisms.

The lab is led by Dr. Brendan Frey, a professor in the Department of Electrical and Computer Engineering, the Banting and Best Department of Medical Research, and the Department of Computer Science. Frey has been working on this project for more than 10 years.

“For every mutation, we can predict whether it’s going to disrupt splicing,” explains Xiong in an interview over Skype.

Most methods for studying genetic disorders examine only the exons— the parts of the genome that encode protein. This is because scientists have known for a long time how genes encode proteins, making it relatively easy to understand how a genetic mutation causes a change in protein structure.

“When something goes wrong there, it’s very easy to predict what the result is. And so that’s kind of the low-hanging fruit that has been done before,” says Bretschneider. 

A great part of our genetic blueprint lies in the introns — the parts of our DNA that don’t encode protein. Introns play a role in determining, among other things, where genes are spliced prior to being expressed as protein.

“So when you have a mutation [in the introns], if it has an effect, then it works through disturbing the regulation of that gene, and that can be really important too. And that’s a type of mutation that we can analyze that you can’t analyze when you only look at the exons,” says Bretschneider.

Many prominent genetic diseases are associated with splicing errors, including spinal muscular atrophy, autism, and some heritable forms of colorectal cancer.

Understanding how the genome signals for splicing is an enormous challenge because it involves very complex interactions between DNA splicing motifs that are not simple to model. The tool is called SPANR, short for Splicing-based Analysis of variants, and applies a technique called machine learning to try and solve this problem.

“Machine learning is a field of artificial intelligence that deals with pattern recognition. So, the applications that you read a lot about these days, is image recognition and speech recognition. It’s like the technology that’s in every phone nowadays. So we’re basically using the same kind of technology and we’re applying it to genetics problems,” says Bretschneider.

Machine learning is also at the heart of other artificial intelligence systems, like IBM’s Jeopardy-playing Watson computer and ROSS, an artificially intelligent lawyer app recently developed by students at U of T.

In terms of how machine learning works, Xiong says that a data training set must be collected.

When asked how machine learning works, Xiong explained that there are two key steps: the first involves the collection of “a data training set that consists of features and the targets that you want to predict,” and the second is to learn “a function that maps from the features to the targets so that the function can be used on future data that you have never seen before.” After following these two steps, the model will be able to make predictions.

In this case, SPANR was trained on healthy genomes and data concerning how these genes are normally spliced. A computer algorithm uses these statistics to build a model, from which it can predict how likely a specific mutation is to alter splicing in new scenarios.

“Because it captures something about the biological mechanism of splicing, it can be used to predict mutations as well — so the effects of mutations, even though the model has never seen mutations during training,” says Xiong.

The results of the study were published in Science last month. SPANR has already seen many successes, including correctly predicting splice sites from genomes it has never seen before.

“We tested on many different data sets… One thing we tried is that we just held out a lot of data from the healthy human genome, trained the model without looking at this held out the data, and then we look at the performance of our model on the held-out data. And it works pretty well,” Xiong notes.

Among other achievements, SPANR accurately predicted 94 per cent of its genes that are already linked to well-studied diseases.

The model is already finding applications in the field. It has discovered many new mutations that could cause various diseases, including 39 new mutations that could be linked to autism.

SPANR produces useful information from the vast amount of data contained in a genome. This new approach to genetics could open the door for personalized medical treatments that are tailored to individuals based on their genes.

According to Bretschneider, there is a lot of potential for the use of SPANR in personalized medicine. “One possibility is that, for example, for people [with] genetic diseases, you might be able to use this technique to design custom drugs for them that bind to that specific location on their RNA where they have a problem, and then change the regulation of that,” he says, adding, “When you look a cancer patient’s genome, you might be able to predict that a certain drug will work better for that person than another drug.”