Researchers at the National Institute of Standards and Technology (NIST) have developed a new statistical tool which they have used to predict protein function. Not only could it help with the difficult task of modifying proteins in practically useful ways, but it also works by fully interpretable methods – an advantage over conventional artificial intelligence (AI) that has helped protein engineering. in the past.
The new tool, called LANTERN, could prove useful in work ranging from producing biofuels to improving crops to developing new treatments for disease. Proteins, as the building blocks of biology, are a key element in all these tasks. But while it is relatively easy to make changes to the DNA strand that serves as the template for a given protein, it remains difficult to determine which specific base pairs – the rungs of the DNA ladder – are the keys to produce a desired effect. Finding these keys has been the job of AI built from deep neural networks (DNNs), which, while efficient, are notoriously opaque to human understanding.
Described in a new article published in the Proceedings of the National Academy of Sciences, LANTERN shows the ability to predict the genetic changes needed to create useful differences in three different proteins. One is the spike-shaped protein from the surface of the SARS-CoV-2 virus that causes COVID-19; Understanding how changes in DNA can alter this spike protein could help epidemiologists predict the future of the pandemic. The other two are well-known workhorses in the laboratory: the LacI protein of the bacterium E. coli and the green fluorescent protein (GFP) used as a marker in biology experiments. Selecting these three topics allowed the NIST team to show not only that their tool works, but also that its results are interpretable – an important feature for industry, which needs predictive methods that help understand the underlying system.
“We have an approach that is fully interpretable and also has no loss of predictive power,” said Peter Tonner, a statistician and computational biologist at NIST and lead developer of LANTERN. “There’s a common assumption that if you want one of these things, you can’t have the other. We’ve shown that sometimes you can have both.
The problem the NIST team is tackling could be imagined as interacting with a complex machine that sports a vast control panel filled with thousands of unlabeled switches: the device is a gene, a strand of DNA that codes a protein; the switches are base pairs on the strand. Switches all affect the output of the device in some way. If your job is to make the machine work differently in a specific way, what switches do you need to flip?
Since the response may require changes of several base pairs, scientists must invert one combination of these, measure the result, then choose a new combination and measure again. The number of permutations is impressive.
“The number of potential combinations may be greater than the number of atoms in the universe,” Tonner said. “You could never measure all the possibilities. That’s a ridiculously high number.
Due to the amount of data involved, DNNs were tasked with sorting through a sample of data and predicting which base pairs should be flipped. At that they succeeded – as long as you don’t ask for an explanation of how they get their answers. They are often described as “black boxes” because their inner workings are inscrutable.
“It’s really hard to understand how DNNs make their predictions,” said NIST physicist David Ross, one of the paper’s co-authors. “And that’s a big deal if you want to use those predictions to design something new.”
LANTERN, on the other hand, is explicitly designed to be understandable. Part of its explainability stems from its use of interpretable parameters to represent the data it analyzes. Rather than allowing the number of these parameters to grow extraordinarily large and often impenetrable, as is the case with DNNs, each parameter in LANTERN’s calculations serves a purpose that is intended to be intuitive, helping users understand what these parameters mean and how they influence LANTERN calculations. predictions.
The LANTERN model represents protein mutations using vectors, widely used mathematical tools often represented visually as arrows. Each arrow has two properties: its direction implies the effect of the mutation, while its length represents the strength of that effect. When two proteins have vectors that point in the same direction, LANTERN indicates that the proteins have a similar function.
The directions of these vectors often correspond to biological mechanisms. For example, LANTERN learned a direction associated with protein folding in all three datasets studied by the team. (Folding plays a critical role in how a protein works, so identifying this factor in the datasets was an indication that the model is working as expected.) When making predictions, LANTERN simply adds these vectors together – a method that users can follow when reviewing its predictions.
Other labs had previously used DNNs to make predictions about switch changes that would make useful changes to the three proteins in question, so the NIST team decided to contrast LANTERN with the DNN results. The new approach was not just enough; according to the team, it reaches a new state of the art in predictive accuracy for this type of problem.
“LANTERN matched or surpassed nearly all alternative approaches in forecasting accuracy,” Tonner said. “It outperforms all other approaches in predicting changes in LacI, and it has comparable predictive accuracy for GFP for all but one. For SARS-CoV-2, it has higher predictive accuracy than all alternatives other than a type of DNN, which matched LANTERN’s accuracy but didn’t beat it.
LANTERN determines which sets of switches have the greatest effect on a given attribute of the protein – its folding stability, for example – and summarizes how the user can modify that attribute to achieve the desired effect. In a way, LANTERN transmutes the many switches on our machine’s panel into a few simple dials.
“It reduces thousands of switches to maybe five little dials that you can turn,” Ross said. “This tells you that the first dial will have a large effect, the second will have a different but smaller effect, the third even smaller, and so on. So as an engineer, this tells me that I can focus on the first and second dial to get the result I need. LANTERN explains all this to me, and it’s incredibly helpful.
Rajmonda Caceres, scientist at MIT Lincoln Laboratory who knows the method behind LANTERN, said she liked the interpretability of the tool.
“There aren’t many AI methods applied to biology applications where they are explicitly designed for interpretability,” said Caceres, who is not affiliated with the NIST study. “When biologists see the results, they can see which mutation is contributing to the protein change. This level of interpretation allows for more interdisciplinary research, as biologists can understand how the algorithm learns and they can generate other information about the biological system under study.
Tonner said that while he’s happy with the results, LANTERN isn’t a panacea for the AI explainability problem. Exploring alternatives to DNNs more broadly would benefit overall efforts to create explainable and trustworthy AI, he said.
“In the context of predicting genetic effects on protein function, LANTERN is the first example of something that rivals DNNs in predictive power while still being fully interpretable,” Tonner said. “It provides a specific solution to a specific problem. We hope that it can be applied to others and that this work will inspire the development of new interpretable approaches. We don’t want predictive AI to remain a black box.
Proceedings of the National Academy of Sciences
The title of the article
Interpretable modeling of genotype-phenotype landscapes with state-of-the-art predictive power
Publication date of articles
June 22, 2022
#black #boxes #stood #NISTs #LANTERN #illuminates