Around here there’s a strong belief that more coverage is better. If a variant is covered by 1000 reads it has to be real, right? After all, what are the chances that each read (or half of them, at least) has an error at the same reference position? Some of our standard pipelines even go so far as to discard all variant calls with depths less than 50 reads. I think most folks when pressed will admit that some errors are due to misaligned reads, and that these types of errors are likely depth-independent. But that’s the exception, right?
I was interested in checking this assumption out for myself. Interestingly, for a while a comprehensive assessment wasn’t possible since no one had a large data set of known variants, and without a ‘gold standard’ where the true variants are known it’s impossible to compute how things like depth are related to accuracy. Fortunately, a few months ago Justin Zook and few folks at NIST have put together a really nice consensus set of variant calls from the oft-sequenced sample NA12878 (I hear the paper is in review). The calls are made from 7 exomes and two genome sequences from a variety of variant callers and platforms, and currently it’s the closest thing we have to a gold-standard set of calls.
With a trusted set of variant calls in hand, it’s not hard to calculate how depth is related to accuracy. We’ve sequenced the NA12878 exome a few times now and it’s a simple matter to find out which calls are true, which are false, and what the depths of each here. To let the cat out of the bag, the following graph sums it up pretty well:
For this data set false positive calls have, on average, somewhat higher coverage than true calls. SNPs especially so. So much for the common wisdom. I haven’t done the math yet, but as far as filtering goes it looks like we’d be better off discarding variants with depth greater than 125 than variants below any particular threshold.
Along similar lines, I was also interested in the relationship between “quality”, as produced by the GATK’s UnifiedGenotyper, and variant call accuracy. The quality score is supposedly reflective of the callers confidence in the variant. For instance, a quality score of 30 indicates that a variant has a 0.001 chance of being false, while a score of 100 indicates that the variant has a 10^-10 chance of being false. In a standard exome with about 50,000 variants it should be very unlikely that any variant with a quality score of > 100 is false. Is this really the case?
As the above figure shows, there’s at least some relationship here – true variants do, on average, have somewhat higher quality scores than false variants. In particular, below phred Q20 or so (around 3 on the scale above) most variants are indeed false. But don’t bet the farm on it. Sadly, above Q150 (near 5 on the scale above) there are still plenty of false positive calls.
Long story short, neither depth nor “quality” are good predictors of whether or not a variant is a false positive. I expect that some more complicated measure of statistics, including things like strand bias, homopolymer run counts, etc., may do a better job. More on that later.