Future Continuous - Comments on the Genomic Data Sharing Policy (GDS Policy) Draft
Comments on the Genomic Data Sharing Policy (GDS Policy) Draft
[photo taken from labs.blog.com]
NIH issued a new proposal for sharing genomic data.
In general, this is a very good development. I had a few suggestions on their proposal and following on their request for comments, I sent them the letter below:
Comments on the Genomic Data Sharing Policy (GDS Policy) Draft
1. Open Data Sharing Plans
The new GDS proposal states: “Investigators and their institutions are expected to address plans for following this Policy in the data sharing section of funding applications and proposals…[T]he NIH expects the informed consent process and documents to state that a participant’s genomic and phenotypic data may be shared broadly for future research purposes and also explain whether the data will be shared through open or controlled access.”
My suggestion is that the data-sharing plan sections of awards will be publicly accessible in RePOTER (or an equivalent system) once the award is granted. Such transparency has several advantages: first, it will allow broader oversight about the adherence of the awardees with their data sharing plans. Second, this will show the level of commitment of the NIH and its grant recipients for data sharing. Third, making the data sharing plans available will enable empirical ELSI research regarding data sharing trends. Fourth, data sharing plans do not contain any sensitive or proprietary information. To the best of my knowledge based on the NIH FOIA Office guidelines, data sharing plans are accessible via FOIA requests. Therefore, there is no particular reason for not publicly posting them in advance.
2. Data Submission Expectations are not Clear
In section C1, Data Submission Expectations and Timeline, the new GDS proposal says: “Human data that are submitted to NIH-designated data repositories should be de-identified according to … the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule”
Specifically, when HIPAA Privacy Rule, the GDS proposal refers to 45 CFR 164.514(b)(2), the HIPAA Safe Harbor section.
The current wording is confusing. The HIPAA Safe Harbor lays out two possible tactics to de-identify data. The first tactic requires the removal of 18 different identifiers, including “Biometric identifiers” (Identifier #16) and “Any other unique identifying number, characteristic, or code” (Identifier #18). The second tactic requires that “[t]he covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.”
Multiple lines of studies from our lab and other groups have shown that genomic and other types of omics data confer sufficient information to identify individuals in various scenarios (Lin et al, Science, 2004; Homer et al., PLoS Genetics, 2008; Shadt et al., Nature Genetics, 2012; Gymrek et al., Science, 2013. For a detailed review: see our preprint in arXiv 1310.3197). Omics datasets inherently contain the biometric identifiers and unique characteristics that are described by HIPAA Safe Harbor. In addition, there is actual knowledge that this information could be used to identify individuals. Therefore, it is impossible to de-identify omics datasets according to the HIPAA Safe Harbor without destroying the actual data. In other words, the GDS proposal asks investigators to submit empty datasets.
I am sure that this was not the intention but HIPAA Safe Harbor tactics are not an adequate mechanism. As an alternative, I suggest that you will specify a subset of identifiers from HIPAA Safe Harbor that should not be included.
Another issue is that the GDS proposal states that “Human data that are submitted to NIH-designated data repositories should be de-identified”. While still a minority, certain study participants are not interested in keeping their data private and might decide to allow sharing with explicit identifiers to facilitate future research (Goolsby et al., IOM Roundtable, 2010). For example, genetic studies of facial dysmorphologies sometimes ask the permission of study participants to share their photos. In other cases, such as the PGP-10, the participants decide to publicly identify themselves. It is impossible to de-identify these datasets. The current GDS proposal seems to prevent the sharing of these studies in NIH-repositories.
My suggestion is to revise this sentence to something similar to “the NIH-designated repositories accept studies with and without explicit identifiers (such as name, photos, or contact information) of participants. The final decision whether to release these explicit identifiers should be addressed in the informed consent”.
3. Including unmapped reads to BAM files
Appendix A of the GDS proposal sets five levels of data based on the amount of processing. Level 2 is the rawest format that is expected to be commonly shared for human studies “[It] would be a file (e.g., binary alignment matrix (BAM) files) usually containing the unmapped reads as well.”
It is highly advisable to include the unmapped reads as a standard in these BAM files to maximize consistency, serendipity, and secondary data usage.
Alignment programs considerably vary in their performance, algorithms to deal with repetitive elements, and gapped alignment methods (Treangen and Salzberg, Nature Reviews Genetics, 2011). These variations can translate to certain error patterns in downstream algorithms. Data fusion from multiple studies with different alignment techniques might be error prone and introduce systematic biases that can create false positive. Including the unaligned reads as a default will enable to recover the original FASTQ files and neutralize most the systematic biases.
In addition, a wide range of software including from our group and others developed specialized algorithms to call “exotic” types of variations and biological events in DNA and RNA data (Hormozdiari et al., Bioinformatics, 2010; Gymrek et al., Genome Research, 2012; Highnam et al., Nucleic Acid Research, 2012; Dobin et al., Bioinformatics, 2013). These algorithms usually require access to the unmapped reads as they employ specialized alignment strategies. Including the unaligned reads as default will enable this type of research.