Benchmarking

Here we will benchmark InterLap

TLDR InterLap is:

  • 40 times faster than naive method with no memory overhead
  • 20 times slower than the bx-python interval tree written in cython

Create 50K random intervals of random lengths between 2KB and 20KB across 100MB of chromosome:

In [1]: import random, sys

In [2]: from interlap import InterLap

In [3]: random.seed(42)

In [4]: sites = random.sample(range(22, 100000000, 12), 50000)

In [5]: sites = [(s, s + random.randint(2000, 20000)) for s in sites]

In [6]: %timeit InterLap(sites)
1 loops, best of 3: 174 ms per loop

In [7]: inter = InterLap(sites)

In [8]: len(inter)
Out[8]: 50000

Now we have the searchable object inter with 50K intervals. Since InterLap is a sorted list, time to create is simply time to create a sorted list.

We can time to see how fast we can test every site in the tree

In [9]: %timeit assert all(r in inter for r in sites)
1 loops, best of 3: 980 ms per loop

And make sure we dont see intervals at positions < 0:

In [10]: %timeit assert not any((-r[0], -r[1]) in inter for r in sites)
1 loops, best of 3: 840 ms per loop

Now do 100K (10K * 10) queries.

In [11]: def times(inter, test_intervals, n_times):
   ....:      for i in xrange(n_times):
   ....:          a = sum(t in inter for t in test_intervals)
   ....: 

In [12]: test_sites = random.sample(range(21, 100000000, 12), 10000)

In [13]: test_sites = [(t, t + random.randint(2000, 200000)) for t in test_sites]

In [14]: %timeit times(inter, test_sites, 10)
1 loops, best of 3: 2.06 s per loop

Compared to Naive Method

We can compare to the naive method (only 1 iteration since it will be slow):

In [15]: def naivetimes(query_intervals, test_intervals):
   ....:      for t in test_intervals:
   ....:          isin = any(q[1] > t[0] and q[0] < t[1] for q in query_intervals)
   ....: 

In [16]: %timeit naivetimes(sites, test_sites)
1 loops, best of 3: 8.8 s per loop

So InterLap is over 40 times (remember we did only 1 rep here instead of 10) as fast as the naive method with no memory overhead.

Compared to bx-python interval-tree

We can compare this to bx-python which uses a tree structure written in cython.

In [17]: from bx.intervals import IntervalTree

In [18]: %timeit IntervalTree(sites)
1000000 loops, best of 3: 324 ns per loop

In [19]: tree = IntervalTree(sites)

In [20]: def bxtimes(tree, test_intervals, n_times):
   ....:      for i in xrange(n_times):
   ....:          a = sum(len(tree.find(ts[0], ts[1])) > 0 for ts in test_intervals)
   ....: 

In [21]: %timeit bxtimes(tree, test_sites, 10)
10 loops, best of 3: 108 ms per loop

As expected, bx-python is much faster, but InterLap does perform quite well at ~50K queries per second.

InterLap will come within 2X of bx-python if most of the query intervals are found in the database.