Here we will benchmark InterLap
TLDR InterLap is:
Create 50K random intervals of random lengths between 2KB and 20KB across 100MB of chromosome:
In [1]: import random, sys
In [2]: from interlap import InterLap
In [3]: random.seed(42)
In [4]: sites = random.sample(range(22, 100000000, 12), 50000)
In [5]: sites = [(s, s + random.randint(2000, 20000)) for s in sites]
In [6]: %timeit InterLap(sites)
1 loops, best of 3: 174 ms per loop
In [7]: inter = InterLap(sites)
In [8]: len(inter)
Out[8]: 50000
Now we have the searchable object inter with 50K intervals. Since InterLap is a sorted list, time to create is simply time to create a sorted list.
We can time to see how fast we can test every site in the tree
In [9]: %timeit assert all(r in inter for r in sites)
1 loops, best of 3: 980 ms per loop
And make sure we dont see intervals at positions < 0:
In [10]: %timeit assert not any((-r[0], -r[1]) in inter for r in sites)
1 loops, best of 3: 840 ms per loop
Now do 100K (10K * 10) queries.
In [11]: def times(inter, test_intervals, n_times):
....: for i in xrange(n_times):
....: a = sum(t in inter for t in test_intervals)
....:
In [12]: test_sites = random.sample(range(21, 100000000, 12), 10000)
In [13]: test_sites = [(t, t + random.randint(2000, 200000)) for t in test_sites]
In [14]: %timeit times(inter, test_sites, 10)
1 loops, best of 3: 2.06 s per loop
We can compare to the naive method (only 1 iteration since it will be slow):
In [15]: def naivetimes(query_intervals, test_intervals):
....: for t in test_intervals:
....: isin = any(q[1] > t[0] and q[0] < t[1] for q in query_intervals)
....:
In [16]: %timeit naivetimes(sites, test_sites)
1 loops, best of 3: 8.8 s per loop
So InterLap is over 40 times (remember we did only 1 rep here instead of 10) as fast as the naive method with no memory overhead.
We can compare this to bx-python which uses a tree structure written in cython.
In [17]: from bx.intervals import IntervalTree
In [18]: %timeit IntervalTree(sites)
1000000 loops, best of 3: 324 ns per loop
In [19]: tree = IntervalTree(sites)
In [20]: def bxtimes(tree, test_intervals, n_times):
....: for i in xrange(n_times):
....: a = sum(len(tree.find(ts[0], ts[1])) > 0 for ts in test_intervals)
....:
In [21]: %timeit bxtimes(tree, test_sites, 10)
10 loops, best of 3: 108 ms per loop
As expected, bx-python is much faster, but InterLap does perform quite well at ~50K queries per second.
InterLap will come within 2X of bx-python if most of the query intervals are found in the database.