Personal Genome Project: Public Comments on NIH draft Genomic Data Sharing Policy
November 21, 2013
The Personal Genome Project (PGP) is a global network of research studies with thousands of participants dedicated to the creation of public resources composed of genome and phenotype data. The first PGP research study was founded at Harvard Medical School in 2005, and international sites now exist in three additional countries.1 The PGP has been at the forefront of participatory research in genome sequencing and has extensive experience with the ethical, privacy, and consent issues involved. We welcome this opportunity to publicly comment on the NIH draft Genomic Data Sharing (GDS) Policy and make recommendations for improvements.
Our recommendations can be summarized as two areas for improvements in section IV.C. of the draft policy: (1) to adequately inform researchers and participants of the inherent identifiability of genetic data, and (2) to require researchers share with participants their personal research data to establish reciprocity and to increase data sharing.
The inherent identifiability of genetic data
The draft GDS Policy makes no mention of the inherent identifiability of genetic data. All genetic and phenotype data shared is mandated to be “de-identified”. Footnote eight of the draft states: “’De-identified’ refers to removing information that could be used to associate a dataset or record with a human individual. Under this Policy, data should be de-identified according to the standards set forth in the HHS Regulations for the Protection of Human Subjects and the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule.”
This definition of “de-identified” is inconsistent: genetic data is inherently identifiable. Using nothing more than genetic data and other publicly available data, researchers were able to identify nearly 50 individuals whose samples were “de-identified” (i.e. all public data met the same standards mandated by this draft).2 It is now a documented fact that this type of genetic data, even if scrubbed of personal information as described in this draft, “could be used to associate a dataset or record with a human individual”. Genetic data itself violates the draft’s definition of “de-identified”.
In the past, de-identification of samples or data sets by stripping personal data (name, social security number, date of birth, etc.) was sufficient to avoid re-identification of a particular subject. Genetic data was not seen as an equivalently identifiable piece of information. This is demonstrated to no longer be the case, and the identifiability of genetic data is likely to increase and may eventually become trivial. Ancestry databases currently link genetic elements to surname and in the future are likely to link genetic elements to individual ancestors. Controlled-access databases create a legal barrier to re-identification, but data security breaches are possible and have been an increasingly high profile issue in recent years. If the NIH is to mandate that all participants in NIH-funded studies producing large-scale genetic data agree to broad sharing of their genetic and phenotypic data, it is mandating an exposure of many participants to a known re-identification risk.
If the NIH wishes to uphold the public trust in biomedical research, it must respect the right of research participants to be informed of relevant risks. If all potential participants in these studies are asked to agree that their “genomic and phenotypic data may be shared broadly for future research use” (link), they must also be adequately informed regarding the identifiability of that data.
We recommend this draft be amended to:
- Add language that acknowledges the inherent identifiability of human genetic data.
- Add to section IV.C.4 instructions for researchers to inform participants regarding the potential identifiability of the genomic data they are sharing (despite planned de-identification procedures) and, in the case of controlled-access data sets, the potential for data security breaches.
Sharing research data with participants
The draft GDS Policy mandates all NIH-funded research studies that wish to produce “large-scale”3 human genetic data require that all participants from whom samples are collected consent that their “genomic and phenotypic data may be shared broadly for future research use” (link). This is elsewhere defined as NIH-designated controlled-access or open-access databases (the latter only if participants “have provided explicit consent for sharing their data through open-access mechanisms”).
What is not addressed in this draft is a statement about genomic data sharing with the participants themselves. We strongly recommend the NIH consider including such a requirement for two reasons.
The first reason is to establish reciprocity in the data sharing mandate. This draft mandates all participants in NIH-funded studies generating large-scale genetic data allow broad access to their genomic and phenotypic data to unknown individuals – without ever having access to that data themselves. Participants’ genetic data is sensitive, meaningful, and identifiable. Participants deserve the reciprocal mandate that their personal data being shared with others also be shared with them.
The second reason is that this is a significant opportunity to further the NIH’s data sharing goals. Participant-managed data sharing is a promising mechanism for open-access data sharing. Even if participants would not have agreed to open-access at the outset of a study, their attitudes may change. Additionally, participants may wish to share their data with future studies in a selective manner. Participant access to data enables an additional participant-managed model for data sharing, and we can imagine a future where numerous studies benefit from participant-donated data.
We recommend the following:
- For participants consented after the effective date of this policy, add a requirement for researchers to give these participants access to their personal data that is shared with other researchers.
- Because some researchers may be unable to comply with this requirement, also allow researchers to instead provide specific reasons for why this data sharing cannot be performed. Some mechanism should also be provided for participants to access these reasons in a study-specific manner (such as in a public database).
1) In section IV.C.4: “If there are compelling scientific reasons that necessitate the use of cell lines or clinical specimens that were created or collected after the effective date of this Policy and that lack consent for research use and data sharing, investigators should provide a justification for the use of any such materials in the funding request.” We suggest clarification of whether the lack of informed consent automatically exempts the researcher from data sharing, or if data sharing is expected to occur despite the exemption.
2) We suggest clarification confirming that “sample identification” using genomic data or other genotypic assays which are not intended to identify individual human participants is acceptable (e.g. detection of duplicate samples across different studies for statistical validity or for quality assurance).
3) “Binary alignment matrix (BAM)” should probably be “Binary Alignment/Map (BAM)”. Assuming this is a reference to SAM and BAM files, there is no clear definition what the BAM acronym abbreviates (“B” could potentially mean “BGZF” or “Binary”), but a SAM file is defined here as a “Sequence Alignment/Map”: http://samtools.sourceforge.net/SAMv1.pdf
Many thanks to the Harvard PGP staff that contributed to these recommendations: Madeleine Ball, Jason Bobe, Michael Chou, George Church, Tom Clegg, Preston Estep, Jeantine Lunshof, and Alexander Wait Zaranek
 Three PGP sites exist currently outside the United States: (1) PGP-Canada, based out of the McLaughlin Centre, University Toronto & Sick Kids Hospital (2) PGP-UK, based out of the University College London and (3) another site in the EU with ethics approval, set to launch in early 2014. The PGP Global network is coordinated by PersonalGenomes.org, a 501(c)(3) nonprofit based in Boston, Massachusetts. To learn more please visit: http://www.personalgenomes.org/mission
 Gymrek M, McGuire AL, Golan D, Halperin E, Erlich Y. “Identifying personal genomes by surname inference.” Science. 2013 Jan 18;339(6117):321-4.
 Defined as more than 100 participants for genotyping or multi-gene sequence data, or whole genome sequence from a single participant.
Link to PDF version of these public comments: NIH_PGP_Public_Comments_GDS_Policy_11202013.pdf