You don't need to pileup

I stumbled on this (now) obvious way of doing things that I hadn’t seen used much/at all; In a project soon to be released, we needed to quickly assay thousands of sites from BAM/CRAM files and do a sort of cheap genotyping–or allele counting. Given this task, a common tool to reach for is the pile-up.

Pile-up is pretty fast but it has to do a lot of work. Even to assay a single site, a pileup will first get each read and a pileup structure (bam_pileup1_t in htslib) for each read into memory. Each pileup structure keeps a sort of cursor in the read. This is a flexible approach and pretty fast when assaying multiple consecutive sites. However, especially when assaying sites more than a few bases apart, it does a lot of extra work so it’s probably faster to not use pileup.

A faster way is to take each read:

iterate over the cigar until the requested genomic position is reached.
calculate the offest into the alignment (query) sequence for that position
check if the query base matches the reference.
discard the read.

The last item is to explicitly note that the reads are not kept in memory. This can be done in any language that allows access to the cigar. And can be much faster in interfaces like pysam whose pileup API, when given a query of even 1 base will create an iterator that generates a pileup for all positions that have an alignment that also overlaps the query location. So if a 10KB nanopore read overlaps that position of interest, it will generate pileups for all 10K sites in that read as well as the single site of interest.

Here is an example using hts-nim which makes it especially palatable (IMNSHO). It queries a particular chrom and position and checks against the expected ref_allele


  for aln in bam.query(chrom, position, position + 1):
    var 
      off = aln.start
      qoff = 0
      roff_only = 0
      nalt = 0
      nref = 0

    for event in aln.cigar:
      var cons = event.consumes
      if cons.query:
        qoff += event.len
      if cons.reference:
        off += event.len
        if not cons.query:
          roff_only += event.len

      # continue until we get to the genomic position
      if off <= position: continue

      # since each cigar op can consume many bases
      # calc how far past the requested position
      var over = off - position - roff_only
      # get the base 
      var base = aln.base_at(qoff - over)
      if base == ref_allele:
        nref += 1
      else:
        nalt += 1

    # now nalt and nref are the allele counts ready for use.

the final result has nalt and nref which can be used for genotyping. The logic could probably be simplified a bit, but I wrote that in very short order and can now re-use. This can easily be extended to get the base-qualities at each position (hts-nim has aln.base_quality_at(qpos) to complement the aln.base_at seen here) and to check for indels as well as SNPs.

I haven’t seen this used elsewhere, perhaps due to the convenience of the pileup API or for lack of looking? I may try to convert this to have a sort of API (or just a function) that allows flexible querying of this sort that hides the cigar accounting.