top of page

Predicting cancer risk on the basis of national health data

Large-scale study uses data from Danish health registries to predict individual risks of developing 20 different types of cancer

Credit: Karen Arnott/EMBL-EBI


  • Early detection of cancer leads to better clinical outcomes, but current screening programmes are limited.

  • Researchers used data from Danish health registries to develop a statistical model that can predict individuals’ risk of developing 20 different types of cancer, based on their family history, disease history, and lifestyle.

  • This work shows that national health registries can be used to identify individuals who are at high risk of developing different cancer types. In the long term, such statistical models could help guide cancer screening programmes to enable earlier detection.

Scientists from EMBL’s European Bioinformatics Institute (EMBL-EBI) and the German Cancer Research Center (DKFZ) have used Danish health registries to predict individual risks for the 20 most common types of cancer. This statistical study is a proof of concept, but the analysis suggests the model could be adapted and transferred to other healthcare systems. It could help to identify people with a high risk of developing cancer, for whom early cancer screening programs could be trialled.

Detecting cancer early gives patients more treatment options and generally results in better clinical outcomes. Current screening programmes only focus on specific cancer types, for example, bowel or cervical cancer, although new blood tests are being trialled that could detect multiple cancer types. If there was a simple way to use health data to calculate an individual’s risk of developing cancer, this could further inform cancer screening.  

Harnessing information-rich data

This large, systematic study used comprehensive data from the Danish health register, in which all clinical diagnoses of the population are stored. The researchers systematically analysed health records, family history, and lifestyle data. While the analysis does not allow an exact prediction of which person will develop cancer, it does determine the individual risk, and enable a comparison with people of a similar age. The paper was published in Lancet Digital Health

The prediction model was first trained on data collected between 1995 and 2014 from 6.7 million adults. The training dataset included more than 90 million diagnoses spanning over 1,000 different diseases. 

The model was then validated on datasets from the same registry, collected between 2015 and 2018, and covering 4.7 million Danes. The agreement between the model’s predictions and the time when individuals developed cancer, if any, was 81%. The model had high accuracy for cancers of the digestive system, as well as for thyroid, kidney, and uterine cancer.

To test if this model would work with health data from other countries, the researchers leveraged data from the UK Biobank and achieved comparable levels of accuracy. 

“Such models are not perfect, but they can provide valuable information for new risk-adapted cancer screening programmes,” said Moritz Gerstung, Division Head at DKFZ and Visiting Research Group Leader at EMBL-EBI. “But the only way this works is by having a system for capturing and leveraging comprehensive health data. The Danish health data ecosystem is unique because it holds digital data for the entire population and spans decades. Only a few European countries offer something similar, including Finland, Iceland and Sweden, or special research cohorts in the UK.”

Factors that influence cancer risk

The work confirmed well-known factors that are associated with cancer, such as smoking and alcohol consumption. The researchers also found that while family history is most informative before the age of 45, at an older age, an individual’s disease history is more informative for their cancer risk. 

The work also suggests that individuals who previously suffered from multiple different diseases can be at a higher risk of developing cancer. “A typical pattern is that, sadly, diseases, including cancer, often come in clusters. This doesn’t mean that preceding diseases cause cancer, and the true reason might be a different one. But it’s a correlation that could be taken into account when calculating cancer risk,” said Gerstung. 

“The novelty of the study lies in the volumes and richness of data we used, and the work we did to scale up well-established statistical models,” said Alexander Jung, Postdoctoral Researcher at the University of Copenhagen and visiting scientist at EMBL-EBI. “Our model covers more than 1,000 factors that can contribute to a person’s risk of developing cancer, and this is huge compared to previous models which only took a few factors into consideration.”


This work was supported by Novo Nordisk Foundation and the Danish Innovation Foundation

Source article(s)

Jung A. W., et al.,

Lancet Digital Health 22 May 2024



bottom of page