This post is highlight the speed benefit of CRAM files over BAM files as it seems to not be widely used.
CRAM files are often about 50% of the size of an identical BAM for lossless compression largely due to not saving the sequence of each read, instead keeping only the delta to the reference sequence for the alignment. Additional savings can be gained from lossy compression of base-qualities and read-names.
Using Nim to count sequence-motifs in a BAM
This is the 2nd post describing mundane uses of nim in day-to-day genomics.
The first post is here.
For today’s mundane task, a colleague asked me to count the occurence of 2 k-mers of length 39 in each of 603 BAMs of 60X coverage. We don’t care about partial matches or allowing mismatches so this is a pretty simple task. There are more efficient computational methods for this, but I wrote the simplest version, verified that it ran in < 2 hours for a single bam and then ran it overnight on a single machine with 64 CPUs and sent the results in the morning.
Nim Hts Example
Several folks have recently expressed interest in learning nim which I have found to be very useful for genomics because it has a simple syntax like python, but it compiles to be as fast as C. In the case of mosdepth which is written in nim, it is faster than the competing C implementations because of choice of algorithm.
I have started using nim in my day-to-day scripting to replace python, in part, so this will be the first in a series of posts that show some relatively mundane code that I write like a script but will compile to run very fast.
You don't need to pileup
I stumbled on this (now) obvious way of doing things that I hadn’t seen used much/at all; In a project soon to be released, we needed to quickly assay thousands of sites from BAM/CRAM files and do a sort of cheap genotyping–or allele counting. Given this task, a common tool to reach for is the pile-up.
Pile-up is pretty fast but it has to do a lot of work. Even to assay a single site, a pileup will first get each read and a pileup structure (bam_pileup1_t in htslib) for each read into memory.