Proteins play a significant role in molecular biology. The amino acid sequences encoded by DNA make up proteins, which later fold into various three-dimensional structures that allow them to execute a myriad of functions, such as metabolizing the food we consume, triggering immune responses, and activating genes to synthesize more proteins.
The natural diversity in proteins can be accredited to evolution; proteins’ function and structure are painstakingly optimized through natural selection. However, the advent of 2023 heralds a new player in protein generation: artificial intelligence (AI) diffusion models, such as DALL-E — an AI model that generates realistic images from a textual description — are sophisticated AI-based models capable of creating and interpreting visual content, and now they’re being used to synthesize proteins.
Researchers at the Kim Lab at U of T have developed ProteinSGM, a score-based generative model (SGM), which can be used to produce new proteins. The Varsity talked to Michael Lee, who designed ProteinSGM, and his supervisor Philip Kim, a professor at U of T’s Donnelly Center for Cellular and Biomolecular Research. Kim explained how Lee devised a way to design protein structures similar to the way AI generates images.
How ProteinSGM works
SGMs traditionally assign scores to different training data samples, which are real-life examples of what the SGM is supposed to emulate. These scores represent how closely the AI-created samples match the characteristics of the original training data. In the case of ProteinSGM, the original training data samples are the representational images of protein structures fed to the system.
Prior to SGMs, protein engineering was slow; a marginal trickle compared to the rapidly compounded progress SGMs have created in the last five years. Currently, the Kim Lab is involved in designing proteins, bolstered by the recent innovation from the machine learning community.
Kim is motivated by the bigger-picture: to take therapeutic antibodies designed by AI from theoretical models to real clinical applications. Still, he admits the basic scientist in him wants to zoom in on the fundamentals and understand more about the protein structures that are actively permitted by nature, through modelling these structures for further study.
Lee wanted to explore generative biology using AI-powered computational models to design molecules. Upon starting his PhD, he turned to the developing field of AI diffusion models, which generate data similar to the training data of protein images. He realized the AI models used for image generation could also be applied to protein design since they operated on the same principle: ‘corrupting’ training data with noise and learning how to recreate data similar to the original training data by fixing the applied corruption. By the end of this training, the AI is capable of generating new structures similar to the ones it had been fed.
The significance of ProteinSGM
Models like ProteinSGM seek to design completely novel proteins with specific binding targets — targeted molecules that cause reactions when they interact with proteins. Lee said that AI-designed proteins structurally resemble their natural counterparts in almost every way, which is ideal for functional purposes. Additionally, AI-generated amino acid sequences don’t require the selective pressures imposed by natural selection to develop into functional structures, which means ProteinSGM can form more diverse structures.
Traditionally, the white blood cells of the human immune system produce specific antibodies: proteins that attack and neutralize invading pathogens and foreign substances. White blood cells undergo a variety of mutations to generate a diverse range of antibodies, and those with high affinities for particular foreign substances are selected for proliferation.
Antibodies have special structures that are important for binding to foreign substances, which have been notoriously hard to model by AI, but Kim and Lee hope that ProteinSGM might be able to alleviate this problem. This would allow researchers to probe the possibility of designing and modelling therapeutic antibodies, bypassing the time-consuming conventional method of using animals’ immune systems to breed antibodies and later harvest them.
Antibody-based therapies can be used to treat a wide range of conditions, from cancer to autoimmune and infectious diseases, but their effectiveness hinges on binding dynamics and sequence diversity. Therein lies the niche ProteinSGM hopes to occupy: experimental validation reveals that diffusion models like ProteinSGM currently have a success rate in designing functional antibodies that’s 50 times higher than previous computational and screening methods — a success rate that Lee hopes will only improve with time.
New protein design through AI diffusion models is cutting-edge research, with the potential to finally generate novel antibodies. The Kim lab is involved in studying how to create new therapeutic proteins and how AI can work for protein development.
Kim said the unique concentration of research in Toronto was conducive to innovation. “Toronto has been quite a great location,” he concluded. “And at U of T, we have the lucky situation of getting many talented students.”