Base quality score rebinning

A few weeks ago Brad Chapman posted an interesting column in his blog, Blue Collar Bioinformatics, about the impact of “rebinning” base quality scores in .sam or .fastq formatted files. Most often, base quality scores are phred-scaled integer values from 0 to 40 represented as single characters that represent the likelihood that a base call is correct.  Most modern variant callers and even some aligners use this information to do their job, and tools even exist to improve the accuracy of base quality scores.  “Rebinning” the quality scores means simply reducing the number of possible values that a quality score may take. Of course some information is lost here – we might simply say that all bases with qualities in the 10-20 range now have a quality of 15. So is there an upside?

Nearly all us use some form of compression to reduce the size of  fastq or sam files. The compression algorithms work by identifying repeated sequences of bytes and, instead of storing each repeated sequence individually, the algorithm adds an indicator that sequence X exists at several positions. This works great for bases. Since there are only four possible nucleotides there are a lot of repeated sequences, and the compression algorithms really reduce the size of the files. Base qualities are a different story. With 40 (or more) possible values there are relatively few repeated sequences and little compression can occur. This is where rebinning comes in: by reducing the number of possible values for the base qualities, compression algorithms are much more efficient. How much more? Here’s a quick breakdown of a sample file:

BQ bins Size (Gb) Decrease %
40 bins 6.8 0
8 bins 4.3 0.367
4 bins 3.9 0.426
2 bins 3.6 0.470

Turns out the decrease is pretty significant. By moving from 40 to only 4 bins, for example, file size can be reduced by almost over 40%. Nothing comes without a price however, and an important concern is the impact on the variant calls produced from these files. To get a handle on the impact on the quality of variant calls from reducing the number of base quality bins I systematically reduced the number of bins in a .bam file from NA12878, then ran the GATK’s UnifiedGenotyper to identify possible variants. To assess quality of the call sets, I compared the results to the NIST’s “Genome in a Bottle” reference calls. (The sample I was working with was an exome, so I only examined sites that were possible to call in our sample).

BQ bins % True variants found False variants
40 bins 0.794 12155
8 bins 0.8 11615
4 bins 0.787 11605
4 bins – no recal 0.787 12658
2 bins 0.795 11953
2 bins – no recal 0.785 12769

The verdict? Decreasing the number of bins from 40 actually increased  specificity, although sensitivity was similar or somewhat worse than 40 bins. Whether or not these differences are “real” and not statistical noise remains to be seen, but even in the latter case reducing the number of bins yields results essentially indistinguishable from 40 bins. A final interesting result comes from the “no recal” rows, in which I’ve called variants without using the GATK’s base quality score recalibration (BQSR). In these cases the variant calls actually did get worse – definitely a few more false positives, although sensitivity was similar. Even down to only 2 quality scores results seem similar to, or maybe somewhat better than 40 bins.To be honest, I found this surprising. Until recently I’ve always suspected that BQSR is of little benefit, but I stand corrected here.   So, similar to Brad Chapman’s results, this seems to indicate that base quality rebinning does not reduce the quality of variant calls. So go forth and rebin, my friends and enjoy smaller .BAM files! Just don’t skip the BQSR.

 

 

 

Advertisements

One response to “Base quality score rebinning

  1. Brendan;
    Thanks for this useful post, and congrats on the SNPSVM publication. I’ve also been looking at the influence of BQSR on variant calls and came to similar conclusions. I also attempted to put a bound on what you miss if you skip quality recalibration. The full blog post is here:

    http://bcbio.wordpress.com/2013/05/06/framework-for-evaluating-variant-detection-methods-comparison-of-aligners-and-callers/

    Thanks again for this. It’s great to have other folks working on similar questions,
    Brad

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s