The AlphaFold2 success: It took a village
By: Aled Edwards, Chief Executive
Today, artificial intelligence is more than about winning games of Go. A few days ago, the scientific world was abuzz with the news that an AI computer program called AlphaFold2 was able to predict the shape of a protein given just its genetic code. This amazing accomplishment constituted a giant step toward solving one of the “big” problems in science and brings us closer to the day to design new proteins to tackle disease, climate change and hunger.
This amazing result was heralded as a win for machine learning but like all scientific accomplishments, the real story is more complex, and reveals so much about how science is done, and on the centrality of open science to progress.
The story started in the 1950’s when Perutz and Kendrew in England grew crystals of a muscle protein called myoglobin and using X-rays to determine its shape, a discovery that took nearly 20 years from beginning to end and was recognized with the Nobel prize in 1962. How profound it must have been for them to be the first to “see” the molecules of life.
Their methodological discovery led dozens of other scientists to use the method to determine the shapes of other important proteins — individual discoveries, including one in which I was lucky enough to play a minor role, that have been recognized with 13 more Nobel prizes.
But even 50 years ago, the community of scientists in this field, known as structural biology, while determining the first structures, had a bigger dream. That one day they’d be able to avoid doing time-consuming experiments and be able to predict the structure of the protein simply from its genetic code using computers. And so they began to lay the groundwork that ultimately led to AlphaFold2.
They realized that data sharing would be critical to the future. So in 1971, U.S. and British scientists formed the Protein Data Bank to serve as a digital library of every protein structure determined and to be determined.
They realized that data quality would be critical. So they implemented statistical data checks and standards to ensure the integrity and quality of the data in the digital protein structure library.
They realized that data access would be critical. So they mandated that the data in the library should be openly available to the world.
They wanted to democratize and facilitate the collection of new data. So they convinced governments, including Canada, to invest and maintain billion-dollar data-collection facilities that provide free access to scientists.
They also realized that transparent mechanisms to compare computer methods would be critical. So in 1994, John Moult in the United States started to organize benchmarking competitions to allow “structure predictors” to hone their programs.
Finally, they realized that the computational methods would require experimental data from a wide diversity of proteins. So governments supported projects, including ours in Canada, to determine the shapes of thousands of different “unusual” proteins — expressly for the purpose of enabling predictive methods.
And now, in 2020, 70 years later, and on the shoulders of tens of thousands of scientists and billions of dollars of investment, with AlphaFold2, it all came together.
This is a huge validation of the potential of AI in biology. But for me, it’s as much a win for the scientific process, and should equally be a celebration of a community’s foresight, a community’s commitment to data quality and data sharing, and the 70 years of support from the taxpayers of the world. We can all take credit for the win.
There is also a lesson for the future. One of the next big problems, such as designing new drugs against proteins to prevent the next pandemic, will be solved only by generating big data, and going big on data quality and data sharing. And we must start now if we are to be ready for the pandemics to come.
The best time to plant an oak tree was 20 years ago, the second-best time is now.