A few weeks ago Brad Chapman posted an interesting column in his blog, Blue Collar Bioinformatics, about the impact of “rebinning” base quality scores in .sam or .fastq formatted files. Most often, base quality scores are phred-scaled integer values from 0 to 40 represented as single characters that represent the likelihood that a base call is correct. Most modern variant callers and even some aligners use this information to do their job, and tools even exist to improve the accuracy of base quality scores. “Rebinning” the quality scores means simply reducing the number of possible values that a quality score may take. Of course some information is lost here – we might simply say that all bases with qualities in the 10-20 range now have a quality of 15. So is there an upside?
Nearly all us use some form of compression to reduce the size of fastq or sam files. The compression algorithms work by identifying repeated sequences of bytes and, instead of storing each repeated sequence individually, the algorithm adds an indicator that sequence X exists at several positions. This works great for bases. Since there are only four possible nucleotides there are a lot of repeated sequences, and the compression algorithms really reduce the size of the files. Base qualities are a different story. With 40 (or more) possible values there are relatively few repeated sequences and little compression can occur. This is where rebinning comes in: by reducing the number of possible values for the base qualities, compression algorithms are much more efficient. How much more? Here’s a quick breakdown of a sample file:
|BQ bins||Size (Gb)||Decrease %|
Turns out the decrease is pretty significant. By moving from 40 to only 4 bins, for example, file size can be reduced by almost over 40%. Nothing comes without a price however, and an important concern is the impact on the variant calls produced from these files. To get a handle on the impact on the quality of variant calls from reducing the number of base quality bins I systematically reduced the number of bins in a .bam file from NA12878, then ran the GATK’s UnifiedGenotyper to identify possible variants. To assess quality of the call sets, I compared the results to the NIST’s “Genome in a Bottle” reference calls. (The sample I was working with was an exome, so I only examined sites that were possible to call in our sample).
|BQ bins||% True variants found||False variants|
|4 bins – no recal||0.787||12658|
|2 bins – no recal||0.785||12769|
The verdict? Decreasing the number of bins from 40 actually increased specificity, although sensitivity was similar or somewhat worse than 40 bins. Whether or not these differences are “real” and not statistical noise remains to be seen, but even in the latter case reducing the number of bins yields results essentially indistinguishable from 40 bins. A final interesting result comes from the “no recal” rows, in which I’ve called variants without using the GATK’s base quality score recalibration (BQSR). In these cases the variant calls actually did get worse – definitely a few more false positives, although sensitivity was similar. Even down to only 2 quality scores results seem similar to, or maybe somewhat better than 40 bins.To be honest, I found this surprising. Until recently I’ve always suspected that BQSR is of little benefit, but I stand corrected here. So, similar to Brad Chapman’s results, this seems to indicate that base quality rebinning does not reduce the quality of variant calls. So go forth and rebin, my friends and enjoy smaller .BAM files! Just don’t skip the BQSR.