October 30th, 2014 3:12 pm ET - Muin J Khoury, Director, Office of Public Health Genomics, Centers for Disease Control and Prevention
The term Big Data is used to describe massive volumes of both structured and unstructured data that is so large and complex it is difficult to process and analyze. Examples of big data include the following: diagnostic medical imaging, DNA sequencing and other molecular technologies, environmental exposures, behavioral factors, financial transactions, geographic information & social media information. It turns out that Big Data is all around us! As Leroy Hood once commented, “We predict that in 5 to 10 years each person will be surrounded by a virtual cloud of billions of data points”(see figure 1). Genome sequencing of humans and other organisms has been a leading contributor to Big Data, but other types of data are increasingly larger, more diverse, and more complex, exceeding the abilities of currently used approaches to store, manage, share, analyze, and interpret it effectively. We have all heard claims that Big Data will revolutionize everything, including health and healthcare.
Consider the following Internet headlines:
Some scientists even claim that “the scientific method itself is becoming obsolete” as giant computers and data analytic software sift through the digital world to provide predictive models for health and disease based on the information available. Are these promises too good to be true?
Today, there are several promising applications of Big Data in improving health involving use of genome sequencing technologies. For example:
Using whole genome sequencing to improved public health detection and response to outbreaks of infectious diseases
Here is a highly pertinent example for public health. In the mid 19th century, a cholera outbreak swept through London. John Snow, now considered by many as the father of modern epidemiology, mapped his investigation on paper, indicating homes where cholera had struck. After long and laborious work, he implicated the Broad Street pump (figure 2) as the source of the outbreak, even before the cause of cholera (the Vibrio cholera bacterium) was known. “Today, Snow might have crunched GPS information and disease prevalence data and solved the problem within hours.”
I believe we should not be carried away by the hype and overpromise of Big Data. In fact, Big Data today is often more noise than signal! Sorting through all of the data to determine what is a real signal and what is noise does not always work as expected. For example, in 2013, when influenza hit the US hard and early, Google attempted to monitor the outbreak using analysis of flu-related Internet searches, drastically overestimating peak flu levels, compared with traditional public health surveillance efforts. Even more problematic could be the potential for many false alarms by mindless examination, on a large scale, leading to putative associations between Big Data points and disease outcomes. This process may falsely infer causality and could potentially lead to ineffective or harmful interventions. Examples of funny but spurious correlations are displayed daily on this fun website to show how analysis of Big Data using online information systems can lead to absurd correlations, such as “honey producing bee colonies inversely correlates with juvenile arrests from marijuana”. The field of genomics has recognized this problem and addressed it by requiring replication of study findings and for signals to be much stronger to be picked. To appropriately analyze big data, the field of genomics requires use of epidemiologic studies, animal models, and other work in addition to big data analysis. Big Data’s strength is in finding associations, but its weakness is in not showing whether these associations have meaning. Finding a signal is only the first step.
Even John Snow himself needed to start with a plausible epidemiological hypothesis to know where to look, i.e., to identify the dataset he needed to examine. If all he had had to go on was a source of Big Data he might well have ended up with a correlation as spurious as the honey bee-marijuana connection. Crucially, Snow also “did the experiment”. He removed the handle from the pump and dramatically reduced the spread of cholera in the population, thus convincingly moving from correlation to causation.
Separating noise from signal won’t be quick or easy as it was obvious from the presentations and the discussions I participated in at the 2014 American Society for Human Genetics special symposium on Big Data. From a public health perspective, I offer 4 recommendations to realize the potential of Big Data in the age of genomics to improve health and prevent disease:
1. Epidemiologic Foundation: We need a strong epidemiologic foundation for studying Big Data in health and disease. The associations found using Big Data need to be studied and replicated in ways that confirm the findings and make them generalizable. By that, I mean the study of well-characterized and representative populations such as the NCI sponsored cohort consortium that has been collecting information on more than 4 million people over multiple decades. Big Data analysis is currently based on convenient samples of people or data available on the Internet. Both sources may be fraught with all sorts of biases such as selection, confounding and lack of generalizability. For more than a decade, we have promoted an epidemiologic approach to the human genome, and now it is time to extend this approach to all Big Data.
2. Knowledge Integration: We need to develop a robust “knowledge integration” (KI) enterprise to make sense of Big Data. In a recent article titled “Knowledge Integration at the Center of Genomic Medicine” we have elaborated on the definition of KI and its three components: knowledge management, knowledge synthesis, and knowledge translation in genomics. A similar evidence-based knowledge integration process applies to all Big Data beyond genomics. We hope that the recently launched NIH Biomedical Data to Knowledge (BD2K) awards will support the development of new approaches, software, tools, and training programs to improve access, analysis, synthesis and interpretation of genomic Big Data and improve our ability to make and validate new discoveries.
3. Evidence-based Medicine: We should embrace (and not run away from) principles of evidence-based medicine and population screening. In a previous blog we have elaborated on therelationship between genomic medicine and evidence-based medicine. I believe the same relationship applies to all Big Data. Big Data is literally a hypothesis-generating machine that could lead to interesting, robust and predictive associations with health outcomes. However, even after these associations are established, evidence of utility (i.e., improved health outcomes and no evidence of harms) is still needed. Documenting health-related utility of genomics and Big Data information may necessitate the use of randomized clinical trials and other experimental designs.
4. Translational Research: As with genomic medicine, we need a robust translational research agenda for Big Data that goes beyond the initial discovery (the bench to bedside model). In genomics, most published research is either basic scientific discoveries or preclinical research designed to develop health related tests and interventions. What happens after that is really the research “road less traveled”. In fact, less than 1% of published research deals with validation, implementation, policy, communication and outcomes in the real world. Reaping the benefits of using Big Data for genomics research will require a more expanded translational research agenda beyond the initial discoveries.
We are interested in our readers’ thoughts on Big Data and examples of success stories in medicine and public health. We are also interested in your feedback on scientific approaches needed to reap the benefits of Big Data in improving health.