crystal is a series python modules to evaluate modeling strategies for correlated data.
Practically, crystal is a python module to assign significance to clusters of correlated data. The most likely use-case is for DNA methylation data, but it can be use for any correlated data.
New users should check-out the introductory IPython notebook here.
Developers interested in evaluating their own method should view this notebook.
crystal models clusters of correlated data.
All functions to model clusters take:
So a single call would look like:
import pandas as pd
import aclust
import crystal
covs_df = pd.read_csv('covariates.csv')
cluster_iter = aclust.mclust(feature_gen(), max_dist=100)
crystal.zscore_cluster("methylation ~ disease + age + gender",
next(cluster_iter),
covs_df,
"disease")
Which will return a dictionary containing the p value, the coef and the t statistic for ‘disease’ status. However it is more likely that users will want a simple way to model every cluster:
for cluster in crystal.model_clusters(cluster_iter,
covs_df,
formula,
"disease",
crystal.zscore_cluster,
n_cpu=10):
print cluster
Here we run through an actual example using data in the repository. It would be simple to modify this script to run on any data.
from __future__ import print_function
import sys
import toolshed as ts
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from aclust import mclust
import crystal
covariates_file = '../../../crystal/tests/covs.csv'
methylation_file = '../../../crystal/tests/meth.txt.gz'
formula = 'methylation ~ age + gender'
coef = 'gender'
covs = pd.read_csv(covariates_file)
def feature_gen(fname):
for i, d in enumerate(ts.reader(fname, header=False)):
if i == 0: continue
chrom, pos = d[0].split(":")
yield crystal.Feature(chrom, int(pos), crystal.logit(np.array(map(float, d[1:]))))
cluster_iter = mclust(feature_gen(methylation_file), max_dist=100, max_skip=0)
fmt = "{chrom}\t{start}\t{end}\t{p:.4g}\t{coef:.3f}\t{n_sites:d}"
print(ts.fmt2header(fmt))
for i, c in enumerate(crystal.model_clusters(cluster_iter,
covs, formula, coef,
model_fn=crystal.zscore_cluster,
n_cpu=1)):
print(fmt.format(**c))
if c['p'] < 1e-3 and abs(c['coef']) > 0.2 and c['n_sites'] > 3:
crystal.plot.spaghetti_plot(c, covs)
plt.savefig('/tmp/figure-1.eps')
break
(Source code, png, hires.png, pdf)
Note that we have this looks like a site where males generally have a lower level of methylation than females. The text output looks like this:
chrom start end coef n_sites
chrX 2733163 2745940 0.0002336 -0.514 3
chrX 2746332 2746420 0.006272 0.257 2
chrX 2746696 2746697 0.5873 -0.077 1
chrX 2747135 2747136 0.5996 0.054 1
chrX 2800456 2800457 0.6346 0.041 1
chrX 2822260 2822261 0.03331 0.168 1
chrX 2825269 2825270 0.0005556 -0.361 1
chrX 2825361 2825362 0.05747 -0.168 1
chrX 2825595 2825596 0.8458 -0.039 1
chrX 2826828 2835913 0.001088 -0.254 2
chrX 2836026 2836084 0.1921 0.148 2
chrX 2836113 2836114 0.8866 -0.024 1
chrX 2836299 2836300 0.001849 -0.288 1
chrX 2843195 2843196 0.01105 -0.173 1
chrX 2844680 2844681 0.07179 -0.272 1
chrX 2846195 2846196 0.5473 -0.075 1
chrX 2846892 2846893 0.13 -0.101 1
chrX 2847353 2847503 0.008624 -0.437 3
chrX 2847509 2847510 0.81 0.039 1
chrX 2847548 2852845 0.0003622 -0.490 6
crystal is developed at https://github.com/brentp/crystal.
It includes example data in the crystal/tests directory.
It has decent test coverage and tests can be run with ./test.sh from the root of the project directory.