crystal documentation

crystal is a series python modules to evaluate modeling strategies for correlated data.

Practically, crystal is a python module to assign significance to clusters of correlated data. The most likely use-case is for DNA methylation data, but it can be use for any correlated data.

New users should check-out the introductory IPython notebook here.

Developers interested in evaluating their own method should view this notebook.

Introduction

crystal models clusters of correlated data.

quick-start

All functions to model clusters take:

  1. formula: a string model formula in R (or patsy syntax).
  2. a methylation cluster from aclust
  3. a pandas datafame containing the covariates of interest. The samples in this dataframe should be in the same order as they are in the methylation cluster.
  4. a coefficient of interest from the model.

So a single call would look like:

import pandas as pd
import aclust
import crystal

covs_df = pd.read_csv('covariates.csv')
cluster_iter = aclust.mclust(feature_gen(), max_dist=100)
crystal.zscore_cluster("methylation ~ disease + age + gender",
                        next(cluster_iter),
                        covs_df,
                        "disease")

Which will return a dictionary containing the p value, the coef and the t statistic for ‘disease’ status. However it is more likely that users will want a simple way to model every cluster:

for cluster in crystal.model_clusters(cluster_iter,
                                      covs_df,
                                      formula,
                                      "disease",
                                      crystal.zscore_cluster,
                                      n_cpu=10):
    print cluster

Example

Here we run through an actual example using data in the repository. It would be simple to modify this script to run on any data.

from __future__ import print_function
import sys
import toolshed as ts
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from aclust import mclust
import crystal

covariates_file = '../../../crystal/tests/covs.csv'
methylation_file = '../../../crystal/tests/meth.txt.gz'
formula = 'methylation ~ age + gender'
coef = 'gender'

covs = pd.read_csv(covariates_file)

def feature_gen(fname):
    for i, d in enumerate(ts.reader(fname, header=False)):
        if i == 0: continue
        chrom, pos = d[0].split(":")
        yield crystal.Feature(chrom, int(pos), crystal.logit(np.array(map(float, d[1:]))))


cluster_iter = mclust(feature_gen(methylation_file), max_dist=100, max_skip=0)


fmt = "{chrom}\t{start}\t{end}\t{p:.4g}\t{coef:.3f}\t{n_sites:d}"
print(ts.fmt2header(fmt))
for i, c in enumerate(crystal.model_clusters(cluster_iter,
                                             covs, formula, coef,
                                            model_fn=crystal.zscore_cluster,
                                            n_cpu=1)):
    print(fmt.format(**c))
    if c['p'] < 1e-3 and abs(c['coef']) > 0.2 and c['n_sites'] > 3:
        crystal.plot.spaghetti_plot(c, covs)
        plt.savefig('/tmp/figure-1.eps')
        break

(Source code, png, hires.png, pdf)

_images/ex-pipeline.png

Note that we have this looks like a site where males generally have a lower level of methylation than females. The text output looks like this:

chrom       start   end     coef    n_sites
chrX        2733163 2745940 0.0002336       -0.514  3
chrX        2746332 2746420 0.006272        0.257   2
chrX        2746696 2746697 0.5873  -0.077  1
chrX        2747135 2747136 0.5996  0.054   1
chrX        2800456 2800457 0.6346  0.041   1
chrX        2822260 2822261 0.03331 0.168   1
chrX        2825269 2825270 0.0005556       -0.361  1
chrX        2825361 2825362 0.05747 -0.168  1
chrX        2825595 2825596 0.8458  -0.039  1
chrX        2826828 2835913 0.001088        -0.254  2
chrX        2836026 2836084 0.1921  0.148   2
chrX        2836113 2836114 0.8866  -0.024  1
chrX        2836299 2836300 0.001849        -0.288  1
chrX        2843195 2843196 0.01105 -0.173  1
chrX        2844680 2844681 0.07179 -0.272  1
chrX        2846195 2846196 0.5473  -0.075  1
chrX        2846892 2846893 0.13    -0.101  1
chrX        2847353 2847503 0.008624        -0.437  3
chrX        2847509 2847510 0.81    0.039   1
chrX        2847548 2852845 0.0003622       -0.490  6

Development

crystal is developed at https://github.com/brentp/crystal.

It includes example data in the crystal/tests directory.

It has decent test coverage and tests can be run with ./test.sh from the root of the project directory.

Indices and tables