Benchmarking ============ Here we will benchmark `InterLap` **TLDR** `InterLap` is: - 40 times faster than naive method with no memory overhead - 20 times slower than the bx-python interval tree written in cython Create 50K random intervals of random lengths between 2KB and 20KB across 100MB of chromosome: .. ipython:: In [1]: import random, sys In [1]: from interlap import InterLap In [2]: random.seed(42) In [3]: sites = random.sample(range(22, 100000000, 12), 50000) In [4]: sites = [(s, s + random.randint(2000, 20000)) for s in sites] In [4]: %timeit InterLap(sites) In [4]: inter = InterLap(sites) In [5]: len(inter) Now we have the searchable object `inter` with 50K intervals. Since `InterLap` is a sorted list, time to create is simply time to create a sorted list. We can time to see how fast we can test every site in the tree .. ipython:: In [6]: %timeit assert all(r in inter for r in sites) And make sure we dont see intervals at positions < 0: .. ipython:: In [7]: %timeit assert not any((-r[0], -r[1]) in inter for r in sites) Now do 100K (10K * 10) queries. .. ipython:: In [8]: def times(inter, test_intervals, n_times): ...: for i in xrange(n_times): ...: a = sum(t in inter for t in test_intervals) ...: In [9]: test_sites = random.sample(range(21, 100000000, 12), 10000) In [10]: test_sites = [(t, t + random.randint(2000, 200000)) for t in test_sites] In [11]: %timeit times(inter, test_sites, 10) Compared to Naive Method ------------------------ We can compare to the naive method (only 1 iteration since it will be slow): .. ipython:: In [1]: def naivetimes(query_intervals, test_intervals): ...: for t in test_intervals: ...: isin = any(q[1] > t[0] and q[0] < t[1] for q in query_intervals) ...: In [2]: %timeit naivetimes(sites, test_sites) So `InterLap` is over 40 times (remember we did only 1 rep here instead of 10) as fast as the naive method with *no memory overhead*. Compared to bx-python interval-tree ----------------------------------- We can compare this to bx-python which uses a tree structure written in cython. .. ipython:: In [1]: from bx.intervals import IntervalTree In [2]: %timeit IntervalTree(sites) In [3]: tree = IntervalTree(sites) In [4]: def bxtimes(tree, test_intervals, n_times): ...: for i in xrange(n_times): ...: a = sum(len(tree.find(ts[0], ts[1])) > 0 for ts in test_intervals) ...: In [5]: %timeit bxtimes(tree, test_sites, 10) As expected, bx-python is much faster, but `InterLap` does perform quite well at ~50K queries per second. `InterLap` will come within 2X of bx-python if most of the query intervals are found in the database.