Modifying Existing VCFs

cyvcf2 is optimized for fast reading and extraction from existing files. However, it also offers some means of modifying existing VCFs.

Modifying the INFO field

Here, we show an example of how to modify the INFO field at a locus to annotate variants with the genes that they overlap.

from cyvcf2 import VCF, Writer
vcf = VCF(VCF_PATH)
# adjust the header to contain the new field
# the keys 'ID', 'Description', 'Type', and 'Number' are required.
vcf.add_info_to_header({'ID': 'gene', 'Description': 'overlapping gene',
    'Type':'Character', 'Number': '1'})

# create a new vcf Writer using the input vcf as a template.
fname = "out.vcf"
w = Writer(fname, vcf)

for v in vcf:
    # The get_gene_intersections function is not shown.
    # This could be any operation to find intersections
    # or any manipulation required by the user.
    genes = get_gene_intersections(v)
    if genes is not None:
        v.INFO["gene"] = ",".join(genes)
    w.write_record(v)

w.close(); vcf.close()

Modifying genotypes and the FORMAT field

Here is an example where we filter some calls from records in a VCF. This demonstrates how to add to the FORMAT field and how to modify genotypes at a locus.

from cyvcf2 import VCF, Writer
vcf = VCF(VCF_PATH)
# adjust the header to contain the new field
# the keys 'ID', 'Description', 'Type', and 'Number' are required.
vcf.add_format_to_header({
    'ID': 'FILTER_CODE',
    'Description': 'Numeric code for filtering reason',
    'Type': 'Integer',
    'Number': '1'
})

# create a new vcf Writer using the input vcf as a template.
fname = "out.vcf"
w = Writer(fname, vcf)

for v in vcf:
    # The filter_samples function is not shown.
    # This could be any manipulation required by the user.
    # Since we specified the FILTER_CODE format field is an Integer
    # with Number=1, reasons must be an n x 1 numpy array of integers
    # where n is the number of samples.
    indicies, reasons = filter_samples(v)
    if indicies:
        # add the reasons array to the format dictionary at this locus
        v.set_format('FILTER_CODE', reasons)
        for index in indicies:
            # overwrite the genotypes of each filtered locus to be nocalls
            # Note: until the reassignment to v.genotypes below, this
            # leaves the v Variant object in an inconsistent state
            v.genotypes[index] = [-1]*v.ploidy + [False]
        # it is necessary to reassign the genotypes field
        # so that the v Variant object reprocess it and future calls
        # to functions like v.genotype.array() properly reflect
        # the changed genotypes
        v.genotypes = v.genotypes

    w.write_record(v)

w.close(); vcf.close()

Known Limitations with Writing VCFs

  • cyvcf2 currently does not support writing VCFs encoded with UTF-8 with non-ASCII characters in the contents of string-typed FORMAT fields.

  • cyvcf2 currently does not support writing string type format fields with number>1.