Measurements of intrahost viral diversity are extremely sensitive to systematic errors in variant calling.

Author information

Abstract

With next generation sequencing technologies, it is now feasible to efficiently sequence patient-derived virus populations at a depth of coverage sufficient to detect rare variants. However, each sequencing platform has characteristic error profiles, and sample collection, target amplification, and library preparation are additional processes whereby errors are introduced and propagated. Many studies account for these errors by using ad hoc quality thresholds and/or previously published statistical algorithms. Despite common usage, the majority of these approaches have not been validated under conditions that characterize many studies of intrahost diversity. Here we use defined populations of influenza virus to mimic the diversity and titer typically found in patient-derived samples. We identified single nucleotide variants using two commonly used variant callers, DeepSNV and LoFreq. We found that the accuracy of these variant callers was lower than expected and exquisitely sensitive to input titer. Small reductions in specificity had a significant impact on the number of minority variants identified and subsequent measures of diversity. We were able to increase the specificity of DeepSNV to >99.95% by applying an empirically validated set of quality thresholds. When applied to a set of influenza samples from a household-based cohort study, these changes resulted in a 10-fold reduction in measurements of viral diversity. We have made our sequence data and analysis code available so that others may improve on our work and use our dataset to benchmark their own bioinformatic pipelines. Our work demonstrates that inadequate quality control and validation can lead to significant overestimation of intrahost diversity.

IMPORTANCE:

Advances in sequencing technology have made it feasible to sequence patient-derived viral samples at a level sufficient for detection of rare mutations. These high-throughput, cost-effective methods are revolutionizing the study of within-host viral diversity. However, these techniques are error prone, and the methods commonly used to control for these errors have not been validated under the conditions that characterize patient-derived samples. Here we show that these conditions affect measurements of viral diversity. We found that the accuracy of previously benchmarked analysis pipelines were greatly reduced under patient-derived conditions. By carefully validating our sequencing analysis using known control samples, we were able to identify biases in our method and improve our accuracy to acceptable levels. Application of our modified pipeline to a set of influenza samples from a cohort study provide a realistic picture of intrahost diversity and suggest the need for rigorous quality control in such studies.