The “Serratus” cloud-computing infrastructure enables researchers to effectively search public sequence databases for biological viruses. So far, more than 130,000 new RNA viruses have been identified – from coronaviruses to relatives of the hepatitis D virus and bacteriophages. The international team behind the project, which also includes researchers from the Heidelberg Institute for Theoretical Studies, reports the findings in the journal “Nature.”
The diversity of viruses on our planet is almost unknown, as so far only a small fraction of existing viruses are known to science. The current SARS-CoV-2 pandemic has shown what devastating consequences emerging viral diseases have for mankind. It is therefore critical to categorize the global diversity of viruses with the aid of methods from computer science and make it usable for science.
Random finds in the rainforest
Public sequence databases have become a vast repository of genetic data, with contributions from researchers around the world. These data come from biological research groups that generate sequence data, whether to study the soil microbiome of the Amazon rainforest or to study the spread of diseases such as the SARS-CoV-2 virus. Typically, such studies obtain genetic sequence data not only from the intended target organism, but also from other organisms whose genetic material happens to be included in the sample. Such incidental data may be of particular interest to other researchers because these data are not the focus of the original study and are therefore usually ignored. However, they are still deposited in the public databases.
An infrastructure for efficient searching
To unearth this hidden treasure means that researchers would have to search through immensely large and distributed data sets. This is because the freely accessible public databases contain sequence data on the order of petabytes (i.e.,one million gigabytes). Researchers in the international Serratus project have developed a cloud-based infrastructure for this purpose. Serratus is an open-source cloud computing infrastructure that is able to perform petabyte-scale sequence alignment.
“Our infrastructure enables efficient searching of the Sequence Read Archive, one of the most popular public sequence repositories,” explains co-author Pierre Barbera from the Computational Molecular Evolution group at the Heidelberg Institute for Theoretical Studies (HITS). He developed software to calculate and analyze the phylogenetic trees of all the species studied. Researchers at the Max Planck Institute for Biology in Tübingen are also involved in the project. They contributed their biocomputing software “DIAMOND” to the project, which, like a search engine, lists matches of protein building blocks of sequenced organisms in just a few hours. Until recently, such calculations required months even with high-performance computers and the previous gold standard BLAST. The enhanced version “DIAMOND v2” is being developed in collaboration with the Max Planck Computing and Data Facility in Garching.
Also involved in the project are scientists from the Institut Pasteur (Paris, France), the University of St. Petersburg (Russia), the University of Valencia, the University of British Columbia (Canada) and UC Berkeley (USA). The corresponding author of the study is bioinformatician Artem Babaian (now at University of Cambridge, UK).
Number of newly discovered viruses increased tenfold
Using the tools developed, the researchers were able to identify more than 130,000 new RNA viruses, a tenfold increase in the number of known virus species. These included previously unknown members of the coronavirus family related to the SARS-CoV-2 virus, novel viruses related to the hepatitis D virus, and novel bacteriophages, viruses that specifically target bacteria.
The results of their study have now been published in the journal Nature. The data from the project is open source and can also be found on the website www.serratus.io , so that researchers can access it and study it further.
Edgar, R.C., Taylor, J., Lin, V. et al. Petabase-scale sequence alignment catalyses viral discovery. Nature, 26 January 2022. DOI: 10.1038/s41586-021-04332-2 / https://www.nature.com/articles/s41586-021-04332-2