Data Science and Machine Learning in Public Health: Promises and Challenges

September 20, 2019 by Chirag J Patel and Danielle Rasooly, Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, and Muin J. Khoury, Office of Public Health Genomics, Centers for Disease Control and Prevention, Atlanta, Georgia

a man holding two circles - one with a bar chart and one with a tablet with data coming out

a man holding two circles - one with a bar chart and one with a tablet with data coming out

In August 2019, two of us (CJP, DR) visited the Centers for Disease Control and Prevention and gave a seminar on the promises and challenges of using “big data” for “precision public health” using the tools of “data science”. The seminar was well attended, with more than 200 participants. The audience was engaged, asking great questions to try to unpack how relevant these new technologies and analytic methods are to public health. Here is a quick summary of what transpired and the road ahead.

First, what do all these terms mean? “Big data” refers to large amount of information, such as data from biobanks (e.g. UK Biobank ) and administrative health claims , becoming available to researchers in a de-identified fashion. What makes them really “big” is the sheer number of individuals represented – on the order of 100s of thousands and millions – and/or the massive amount of information about people involved. This includes information such as their postal code, and in some cases, their genomes. Given that the primary use for these datasets is often not research, but other purposes such as billing, the natural question is, “are these data helpful for health-related discoveries and public health surveillance?”

In our seminar, we showed that one way to tackle big data is to use the approaches of machine learning and data science, which summarize the way we process big data (e.g., tidyverse ), learn patterns in the data, and ultimately validate patterns to make sure they make sense (e.g., these approaches can be deployed to doctors, patients, or policy makers). We can think of machine learning as computationally-demanding methods that analyze complex relationships between variables — for example, finding links between massive clinical or environmental factors and risk for disease.

One example that demonstrates the potential of machine learning to improve the accuracy of disease diagnosis comes from medical image analysis, such as automating screening for diabetic retinopathy . Diabetes may affect 100 million people globally, but manual analysis of image data is currently a bottleneck that slows down screening and ultimately, preventative care and treatment. Work from researchers at Google and its collaborators from around the world shows how a new branch of machine learning – deep learning – can automate image analysis at an accuracy level equivalent to the very best physician examiners.

Big data research has been enabled by the availability of computer power and image data to execute complex machine learning algorithms. However, understanding why the computer makes such a decision is still difficult and could pose a major roadblock for adoption.

Another example is the integration of data types to better understand complex associations between genetics, environment, and disease. The Harvard group has been using large administrative datasets to untangle the relationship between genetics and environment in all diseases recorded in health insurance claims data .

Scientists around the world have also been using biobanks to discover new genetic variants, such as genome-wide association studies , environment-wide association studies , and family-history-wide association studies to identify novel exposures associated with disease risk that might have been missed (or false positive) when studying them one at a time .

Deploying machine learning comes with many challenges such as limited generalizability and confounding and complex correlation between variables. Many of these challenges are not unique to machine learning. New modeling approaches (see example here ) have the additional caveat of being not easily explainable to clinicians or policy makers. Analysts are faced with choices of which variables to model; these are often arbitrary and can lead to different findings or interpretation.

Regardless of the promises and challenges of big data and machine learning, we can all be better data scientists by learning about this field and how to use machine learning. The good news is that there are plentiful and accessible materials to accelerate “human learning” about “machine learning.” To get started, check out the course on ‘Data Science for Medical Decision Making’.

genomics