hts-nim is a library that allows one to use htslib via the nim programming language.
Nim is a garbage-collected language that compiles to C and often has similar performance.
I have become very productive in nim
and especially in hts-nim
and there are by now, at least a few other users of hts-nim
.
This post is to show how one particular feature of nim enables users to write their own functions that will be used no differently than
hts-nim
’s provided functionality.
Currently, in order to access the format (sample) fields in a VCF, one must first allocate a seq
(something like a typed python list) that gets filled.
The example below shows a code stub to extract the DP
(depth) and GQ
(genotype quality) fields for each sample into those pre-allocated seq
s.
var dps = newSeq[int32]()
var gqs = newSeq[int32]()
for variant in vcf:
doAssert variant.format.get("DP", dps) == Status.OK
doAssert variant.format.get("GQ", gqs) == Status.OK
# do stuff with dps and gqs ...
This lets hts-nim
fill the dps
and gqs
sequences without allocating memory to provides maximal
performance. It’s not much of a burden, but it might be nice to have a function that just returns the values and raises
an error if the field is not found so that pre-allocation is not necesary.
Since nim has universal function call syntax (UFCS), it’s easy to write some sugar that can be used to get values without pre-allocating if that’s what’s required. Here’s how I’d write that (though see note at end of post).
proc ints*(f:FORMAT, field: string): seq[int32] =
var vals = newSeq[int32]()
if variant.format.get(field, vals) != Status.OK:
raise newException(KeyError, "error getting field:" & field & ". " & $Status.OK
return vals
This will throw an exception if the requested field does not exist. Note that the *
after ints
means
the function is exported for use outside the currrent module, which is what we want here.
This function could be in a file named sugar.nim
and then could be used as:
import hts/vcf
import ./sugar
for variant in vcf:
var dps = variant.format.ints("DP") # note these could raise exceptions.
var gqs = variant.format.ints("GQ")
# do stuff with dps and gqs ...
This shows the UFCS
as ints(variant.format, "DP")
is equivalent to variant.format.ints("DP")
.
This means that users are not limited to the syntax and functionality that hts-nim
provides, they can
write their own functions and use them as ergonomically as those that are part of hts-nim. Since hts-nim
errs on the side of efficiency rather than ergonomics, functions like these could make some nice syntactic
sugar for using hts-nim
as a sort of scripting language for processing genomic data.
side note.
The above sugar function ints
could be written more succinctly using nims
resultvariable along with the
fact that
seqs are never
nil` as:
proc ints*(f:FORMAT, field: string): seq[int32] =
if variant.format.get(field, result) != Status.OK:
raise newException(KeyError, "error getting field:" & field & ". " & $Status.OK
where result will be initalized by nim and set to the proper size (and filled) by hts-nim.