Poking Holes in Genetic Privacy
By GINA KOLATA
Published: June 16, 2013
Not so long ago, people who provided DNA in the course of research studies were told that their privacy was assured. Their DNA sequences were on publicly available Web sites, yes, but they did not include names or other obvious identifiers. These were research databases, scientists said, not like the forensic DNA banks being gathered by the F.B.I. and police departments.
But geneticists nationwide have gotten a few rude awakenings, hints that research subjects in fact could sometimes be identified by their DNA alone, or even by the way their cells were using their DNA. The latest shock came in January, when a researcher at the Whitehead Institute, which is affiliated with the Massachusetts Institute of Technology, managed to track down five people selected at random from a database using only their DNA, ages and the states in which they lived. And he did it in just hours. He also found relatives — a total of close to 50 people.
This month an international group of nearly 80 researchers, patient advocates, universities and organizations like the National Institutes of Health announced that it wants to consolidate the world’s databases of DNA and other genetic information, making data easier for researchers to retrieve and share. But the security and privacy of the study subjects are paramount concerns, said Dr. David Altshuler of the Broad Institute of Harvard and M.I.T., a leader of the group.
“The problems are not yet solved in any general way,” Dr. Altshuler said. “We want to work to solve them.”
For years now, a steady stream of research has eroded scientists’ faith that DNA can be held anonymously.
The first shock came in 2008, when David W. Craig, a geneticist at TGen, a research institute in Phoenix, and his colleagues imagined a theoretical problem. Suppose you are trying to learn what percentage of intravenous drug users are infected with hepatitis, and you collect DNA from discarded needles and amass it in a database to look for signs of the virus in the genetic material. Is there any way, they wondered, to find a particular person’s DNA is in this soup of genes?
Most researchers would have said the task was impossible, worse than looking for a needle in a haystack. But Dr. Craig and his colleagues found a way to do it, exploiting the four million or so tiny, and usually inconsequential, differences in DNA letters between one individual and another. With their method, using the combinations of hundreds of thousands of DNA markers, the researchers could find a person even if his or her DNA constituted just 0.1 percent of the total in the mix.
So explosive was the finding that Dr. Craig deliberately chose to write about it only very technically. The N.I.H. understood what he had accomplished, though, and quickly responded, moving all genetic data from the studies it financed behind Internet firewalls to prevent the public or anyone not authorized from using the data and, it was hoped, to protect the identities of research subjects.
But another sort of genetic data — so-called RNA expression profiles that show patterns of gene activity — were still public. Such data could not be used to identify people, or so it was thought.
Then Eric E. Schadt of Mount Sinai School of Medicine discovered that RNA expression data could be used not only to identify someone but also to learn a great deal about that person. “We can create a profile that reflects your weight, whether you are diabetic, how old you are,” Dr. Schadt said. He and a colleague also were able to tell if a person is infected with viruses, like HPV or H.I.V., that change the activity of genes. Moreover, they were able to make what they called a genetic bar code that could be used to identify a person in a number of DNA databases.
Then, this year, in perhaps the most disturbing exercise, Yaniv Erlich, a genetics researcher at the Whitehead Institute, used a new computational tool he had invented to identify by name five people from their DNA, which he had randomly selected from a research database containing the genes of one thousand people.
Experts were startled by what Dr. Erlich had done. “We are in what I call an awareness moment,” said Eric D. Green, director of the National Human Genome Research Institute at the National Institutes of Health.
Research subjects who share their DNA may risk a loss of not just their own privacy but also that of their children and grandchildren, who will inherit many of the same genes, said Mark B. Gerstein, a Yale professor who studies large genetic databases.
Even fragments of genetic information can compromise privacy. James Watson, a discoverer of DNA’s double helix shape, had his genes sequenced and made the information public — except for one, the sequence for ApoE, a gene that has a variant linked to an increased risk of Alzheimer’s disease. Researchers noticed, though, that they could still figure out if Dr. Watson had that variant by examining the DNA on either side of the gene he had removed. They did not reveal whether he had it.
With so many questions about the privacy and security of genetic data, researchers wonder what research subjects should be told. Leaks and identification of study subjects will never be completely avoidable, said George Church, a Harvard geneticist. And as much as investigators might like to find a way to keep genetic data secure and private, he does not think there is an exclusively technical solution.
“If you believe you can just encrypt terabytes of data or anonymize them, there will always be people who hack through that,” Dr. Church said.
He believes that people who provide genetic information should be informed that a loss of privacy is likely, rather than unlikely, and agree to provide DNA with that understanding.
Other researchers say the idea is not far-fetched, and some suggest that scientists be licensed before they are given access to genetic databases, with severe penalties for those who breach privacy.
“My fear is not so much that someone will take everyone’s genomes and put them on the Web,” Dr. Gerstein said. “It is that a graduate student in some lab somewhere will naïvely post bits of genomes on his Facebook page. The idea is that before he could get access to genomes, he would be taught he can’t do that. And if he did he would lose his license.”
The amount of genetic data that has been gathered so far is minuscule compared with what will be coming in the next few years, Dr. Altshuler noted, making it important to address the problems before the data deluge makes them worse.
“We see substantial issues,” he said. “We want to have serious discussions now.”