cyvcf2 API¶
- class cyvcf2.cyvcf2.VCF(fname, mode='r', gts012=False, lazy=False, strict_gt=False, samples=None, threads=None)¶
VCF class holds methods to iterate over and query a VCF.
- Parameters:
fname (str) – path to file
gts012 (bool) – if True, then gt_types will be 0=HOM_REF, 1=HET, 2=HOM_ALT, 3=UNKNOWN. If False, 3, 2 are flipped.
lazy (bool) – if True, then don’t unpack (parse) the underlying record until needed.
strict_gt (bool) – if True, then any ‘.’ present in a genotype will classify the corresponding element in the gt_types array as UNKNOWN.
samples (list) – list of samples to extract from full set in file.
threads (int) – the number of threads to use including this reader.
- Return type:
VCF object for iterating and querying.
- add_filter_to_header(self, adict)¶
Add a FILTER line to the VCF header.
- Parameters:
adict (dict) – dict containing keys for ID, Description.
- add_format_to_header(self, adict)¶
Add a FORMAT line to the VCF header.
- Parameters:
adict (dict) – dict containing keys for ID, Number, Type, Description.
- add_info_to_header(self, adict)¶
Add a INFO line to the VCF header.
- Parameters:
adict (dict) – dict containing keys for ID, Number, Type, Description.
- add_to_header(self, line)¶
Add a new line to the VCF header.
- Parameters:
line (str) – full vcf header line.
- contains()¶
Check if the given ID is in the header.
- gen_variants(self, sites, offset=0, each=1, call_rate=0.8)¶
- get_header_type(self, key, order=[BCF_HL_INFO, BCF_HL_FMT])¶
Extract a field from the VCF header by id.
- Parameters:
key (str) – ID to pull from the header.
- Returns:
rec – dictionary containing header information.
- Return type:
dict
- header_iter(self)¶
Iterate over fields in the HEADER
- raw_header¶
string of the raw header from the VCF
- samples¶
list of samples pulled from the VCF.
- seqlens¶
list of chromosome lengths, if defined in the VCF header
- seqnames¶
list of chromosomes in the VCF
- set_index(self, index_path='')¶
- set_samples(self, samples)¶
Set the samples to be pulled from the VCF; this must be called before any iteration.
- Parameters:
samples (list) – list of samples to extract.
- class cyvcf2.cyvcf2.Variant(*args, **kwargs)¶
Variant represents a single VCF Record.
It is created internally by iterating over a VCF.
- INFO¶
a dictionary-like field that provides access to the VCF INFO field.
- Type:
INFO
- POS¶
- Type:
the 1-based variant start.
- ALT¶
the list of alternate alleles.
- CHROM¶
Chromosome of the variant.
- FILTER¶
the value of FILTER from the VCF field.
a value of PASS or ‘.’ in the VCF will give None for this function
- FILTERS¶
the FILTER values as a list from the VCF field.
a value ‘.’ in the VCF will return an empty list for this property
- FORMAT¶
VCF FORMAT field for this variant.
- ID¶
the value of ID from the VCF field.
- QUAL¶
the float value of QUAL from the VCF field.
- REF¶
the reference allele.
- aaf¶
alternate allele frequency across samples in this VCF.
- call_rate¶
proportion of samples that were not UNKNOWN.
- end¶
end of the variant. the INFO field is parsed for SVs.
- format(self, field, vtype=None)¶
format returns a numpy array for the requested field.
The numpy array shape will match the requested field. E.g. if the fields has number=3, then the shape will be (n_samples, 3).
- Parameters:
field (str) – FORMAT field to get the values.
- Return type:
numpy array.
- genotypes¶
genotypes returns a list for each sample Indicating the allele and phasing.
e.g. [0, 1, True] corresponds to 0|1 while [1, 2, False] corresponds to 1/2
- gt_alt_depths¶
get the count of alternate reads as a numpy array.
- gt_alt_freqs¶
get the freq of alternate reads as a numpy array.
- gt_bases¶
numpy array indicating the alleles in each sample.
- gt_depths¶
get the read-depth for each sample as a numpy array.
- gt_phases¶
get a boolean indicating whether each sample is phased as a numpy array.
- gt_phred_ll_het¶
get the PL of het for each sample as a numpy array.
- gt_phred_ll_homalt¶
get the PL of hom_alt for each sample as a numpy array.
- gt_phred_ll_homref¶
get the PL of Hom ref for each sample as a numpy array.
- gt_quals¶
get the GQ for each sample as a numpy array.
- gt_ref_depths¶
get the count of reference reads as a numpy array.
- gt_types¶
gt_types returns a numpy array indicating the type of each sample.
HOM_REF=0, HET=1. For gts012=True HOM_ALT=2, UNKNOWN=3
- is_deletion¶
boolean indicating if the variant is a deletion.
- is_indel¶
boolean indicating if the variant is an indel.
- is_mnp¶
boolean indicating if the variant is a MNP.
- is_snp¶
boolean indicating if the variant is a SNP.
- is_sv¶
boolean indicating if the variant is an SV.
- is_transition¶
boolean indicating if the variant is a transition.
- num_called¶
number of samples that were not UNKNOWN.
- num_het¶
number heterozygous samples at this variant.
- num_hom_alt¶
number homozygous alternate samples at this variant.
- num_hom_ref¶
number homozygous reference samples at this variant.
- num_unknown¶
number unknown samples at this variant.
- ploidy¶
get the ploidy of each sample for the given record.
- set_format(self, name, ndarray data)¶
set the format field given by name.. data must be a numpy array of type float, int or string (fixed length ASCII np.bytes_)
- set_pos(self, int pos0)¶
set the POS to the given 0-based position
- start¶
0-based start of the variant.
- var_type¶
type of variant (snp/indel/sv)
- class cyvcf2.cyvcf2.Writer(fname, VCF tmpl, mode=None)¶
Writer class makes a VCF Writer.
- Parameters:
fname (str) – path to file
tmpl (VCF) – a template to use to create the output header.
mode (str) –
Mode to use for writing the file. IfNone
(default) is given, the mode is inferred from the filename extension. If stdout ("-"
) is provided forfname
andmode
is left at default, uncompressed VCF will be produced.Valid values are:-"wbu"
: uncompressed BCF-"wb"
: compressed BCF-"wz"
: compressed VCF-"w"
: uncompressed VCFCompression level can also be indicated by adding a single integer to one of the compressed modes (e.g."wz4"
for VCF with compressions level 4).
Note
File extensions
.bcf
and.bcf.gz
will both return compressed BCF. If you want uncompressed BCF you must explicitly provide the appropriatemode
.- Return type:
VCF object for iterating and querying.
- close(self)¶
- from_string(type cls, fname, header_string, mode=u'w')¶
- name¶
bytes
- Type:
name
- variant_from_string(self, variant_string)¶
- write_header(self)¶
- write_record(self, Variant var)¶
Write the variant to the writer.
- class cyvcf2.cyvcf2.INFO¶
INFO is create internally by accessing Variant.INFO
is acts like a dictionary where keys are expected to be in the INFO field of the Variant and values are typed according to what is specified in the VCF header
Items can be deleted with del v.INFO[key] and accessed with v.INFO[key] or v.INFO.get(key)