cyvcf2 API

class cyvcf2.cyvcf2.VCF(fname, mode='r', gts012=False, lazy=False, strict_gt=False, samples=None, threads=None)

VCF class holds methods to iterate over and query a VCF.

Parameters:
  • fname (str) – path to file

  • gts012 (bool) – if True, then gt_types will be 0=HOM_REF, 1=HET, 2=HOM_ALT, 3=UNKNOWN. If False, 3, 2 are flipped.

  • lazy (bool) – if True, then don’t unpack (parse) the underlying record until needed.

  • strict_gt (bool) – if True, then any ‘.’ present in a genotype will classify the corresponding element in the gt_types array as UNKNOWN.

  • samples (list) – list of samples to extract from full set in file.

  • threads (int) – the number of threads to use including this reader.

Return type:

VCF object for iterating and querying.

add_filter_to_header(self, adict)

Add a FILTER line to the VCF header.

Parameters:

adict (dict) – dict containing keys for ID, Description.

add_format_to_header(self, adict)

Add a FORMAT line to the VCF header.

Parameters:

adict (dict) – dict containing keys for ID, Number, Type, Description.

add_info_to_header(self, adict)

Add a INFO line to the VCF header.

Parameters:

adict (dict) – dict containing keys for ID, Number, Type, Description.

add_to_header(self, line)

Add a new line to the VCF header.

Parameters:

line (str) – full vcf header line.

contains()

Check if the given ID is in the header.

gen_variants(self, sites, offset=0, each=1, call_rate=0.8)
get_header_type(self, key, order=[BCF_HL_INFO, BCF_HL_FMT])

Extract a field from the VCF header by id.

Parameters:

key (str) – ID to pull from the header.

Returns:

rec – dictionary containing header information.

Return type:

dict

header_iter(self)

Iterate over fields in the HEADER

raw_header

string of the raw header from the VCF

relatedness(self, int n_variants=35000, int gap=30000, float min_af=0.04, float max_af=0.8, float linkage_max=0.2, min_depth=8)
samples

list of samples pulled from the VCF.

seqlens

list of chromosome lengths, if defined in the VCF header

seqnames

list of chromosomes in the VCF

set_index(self, index_path='')
set_samples(self, samples)

Set the samples to be pulled from the VCF; this must be called before any iteration.

Parameters:

samples (list) – list of samples to extract.

class cyvcf2.cyvcf2.Variant(*args, **kwargs)

Variant represents a single VCF Record.

It is created internally by iterating over a VCF.

INFO

a dictionary-like field that provides access to the VCF INFO field.

Type:

INFO

POS
Type:

the 1-based variant start.

ALT

the list of alternate alleles.

CHROM

Chromosome of the variant.

FILTER

the value of FILTER from the VCF field.

a value of PASS or ‘.’ in the VCF will give None for this function

FILTERS

the FILTER values as a list from the VCF field.

a value ‘.’ in the VCF will return an empty list for this property

FORMAT

VCF FORMAT field for this variant.

ID

the value of ID from the VCF field.

QUAL

the float value of QUAL from the VCF field.

REF

the reference allele.

aaf

alternate allele frequency across samples in this VCF.

call_rate

proportion of samples that were not UNKNOWN.

end

end of the variant. the INFO field is parsed for SVs.

format(self, field, vtype=None)

format returns a numpy array for the requested field.

The numpy array shape will match the requested field. E.g. if the fields has number=3, then the shape will be (n_samples, 3).

Parameters:

field (str) – FORMAT field to get the values.

Return type:

numpy array.

genotypes

genotypes returns a list for each sample Indicating the allele and phasing.

e.g. [0, 1, True] corresponds to 0|1 while [1, 2, False] corresponds to 1/2

gt_alt_depths

get the count of alternate reads as a numpy array.

gt_alt_freqs

get the freq of alternate reads as a numpy array.

gt_bases

numpy array indicating the alleles in each sample.

gt_depths

get the read-depth for each sample as a numpy array.

gt_phases

get a boolean indicating whether each sample is phased as a numpy array.

gt_phred_ll_het

get the PL of het for each sample as a numpy array.

gt_phred_ll_homalt

get the PL of hom_alt for each sample as a numpy array.

gt_phred_ll_homref

get the PL of Hom ref for each sample as a numpy array.

gt_quals

get the GQ for each sample as a numpy array.

gt_ref_depths

get the count of reference reads as a numpy array.

gt_types

gt_types returns a numpy array indicating the type of each sample.

HOM_REF=0, HET=1. For gts012=True HOM_ALT=2, UNKNOWN=3

is_deletion

boolean indicating if the variant is a deletion.

is_indel

boolean indicating if the variant is an indel.

is_mnp

boolean indicating if the variant is a MNP.

is_snp

boolean indicating if the variant is a SNP.

is_sv

boolean indicating if the variant is an SV.

is_transition

boolean indicating if the variant is a transition.

num_called

number of samples that were not UNKNOWN.

num_het

number heterozygous samples at this variant.

num_hom_alt

number homozygous alternate samples at this variant.

num_hom_ref

number homozygous reference samples at this variant.

num_unknown

number unknown samples at this variant.

ploidy

get the ploidy of each sample for the given record.

set_format(self, name, ndarray data)

set the format field given by name.. data must be a numpy array of type float, int or string (fixed length ASCII np.bytes_)

set_pos(self, int pos0)

set the POS to the given 0-based position

start

0-based start of the variant.

var_type

type of variant (snp/indel/sv)

class cyvcf2.cyvcf2.Writer(fname, VCF tmpl, mode=None)

Writer class makes a VCF Writer.

Parameters:
  • fname (str) – path to file

  • tmpl (VCF) – a template to use to create the output header.

  • mode (str) –

    Mode to use for writing the file. If None (default) is given, the mode is inferred from the filename extension. If stdout ("-") is provided for fname and mode is left at default, uncompressed VCF will be produced.
    Valid values are:
    - "wbu": uncompressed BCF
    - "wb": compressed BCF
    - "wz": compressed VCF
    - "w": uncompressed VCF
    Compression level can also be indicated by adding a single integer to one of the compressed modes (e.g. "wz4" for VCF with compressions level 4).

Note

File extensions .bcf and .bcf.gz will both return compressed BCF. If you want uncompressed BCF you must explicitly provide the appropriate mode.

Return type:

VCF object for iterating and querying.

close(self)
from_string(type cls, fname, header_string, mode=u'w')
name

bytes

Type:

name

variant_from_string(self, variant_string)
write_header(self)
write_record(self, Variant var)

Write the variant to the writer.

class cyvcf2.cyvcf2.INFO

INFO is create internally by accessing Variant.INFO

is acts like a dictionary where keys are expected to be in the INFO field of the Variant and values are typed according to what is specified in the VCF header

Items can be deleted with del v.INFO[key] and accessed with v.INFO[key] or v.INFO.get(key)