mosdepth ideas

logo

I’m working on a new project and part of it is made possible by an observation that we stumbled on with mosdepth. It’s something that’s obvious in retrospect but wasn’t fully apparent to me until after mosdepth was mostly written. In short, that observation is computers can do stuff with arrays quickly.

The longer story behind that obvious and simple observation is as follows. mosdepth is a tool to calculate depth from BAM/CRAM files. Rather than streaming out the coverage by keeping a heap of reads and keeping a cursor in each read to indicate the position and popping reads off the heap as the cursor moves past the end of the read and adding reads onto the heap as the genomic position advances, and … [snip], mosdepth instead allocates an array the size of the current chromosome, then considers each read one-at-a-time. For each, it increments the read start position in the array by 1 and decrements then end position by 1 discards that read, and proceeds with the next. Aaron used this same algorithm in bedtools genome coverage which uses 2 arrays instead of 1. mosdepth does a bit more work for reads with deletions and reads that overlap their mate, but that’s it. Once all reads from a chromosome are consumed and added, then the coverage is the cumulative sum of all elements in the array.

Given this array for human chromosome 1 which is over 249 million bases (or 249 million int32’s), in my laptop it takes about:

0.125 seconds to calculate this cumsum to convert the array values from start/end incr/decrs to per-base coverage
0.204 seconds to calculate the mean coverage for each of 111626 exons for all known transcripts on chromosome 1.
0.756 seconds to calculate quantized coverage (contiguous bins of coverage at specified depths) of -q 0:1:2:3:4:5:6:7:8:12:16:20:25:30:40:
0.508 seconds to calculate the cumulative coverage distribution (proportion of bases covered at 1X, 2X… $NX)

… and so on. So yeah, computers are good at arrays and stuff.

We can add a lot of utility to mosdepth, all of them in a single run and it doesn’t affect the runtime appreciably since (decompressing and) parsing the bam dominate the run-time. Another benefit is that once everything is in the array, the times of all subsequent operations are independent of the depth. I didn’t realize all of these niceties until after it was mostly implemented.

So, this is convenient. mosdepth doesn’t have too much in the way of new ideas, it just takes full advantage. One thing I realized later is that depth is a special case of a more generic thing we might want to measure. For depth, we increment/decrement the start/end of each read (or the read’s component segments), but one could increment/decrement the start/end of any event. For example the per-base locations of soft-clips/hard-clips/insertions/deletions, etc. I did some of this in bigly which is based on the heap structure that we avoided with mosdepth, but next post, I’ll talk about what I’ll be working on and how it’s useful to do this in the mosdepth way.

p.s. thanks to Tom Sasani (AKA sixsani) for the logo and to Aaron for brainstorming sessions on what became the core of what became mosdepth