Working with BAM/CRAM/SAM-formatted files

Opening a file

To begin with, import the pysam module and open a pysam.AlignmentFile:

import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb")

The above command opens the file ex1.bam for reading. The b qualifier indicates that this is a BAM file. To open a SAM file, type:

import pysam
samfile = pysam.AlignmentFile("ex1.sam", "r")

CRAM files are identified by a c qualifier:

import pysam
samfile = pysam.AlignmentFile("ex1.cram", "rc")

Fetching reads mapped to a region

Reads are obtained through a call to the pysam.AlignmentFile.fetch() method which returns an iterator. Each call to the iterator will returns a pysam.AlignedSegment object:

iter = samfile.fetch("seq1", 10, 20)
for x in iter:
    print(str(x))

pysam.AlignmentFile.fetch() returns all reads overlapping a region sorted by the first aligned base in the reference sequence. Note that it will also return reads that are only partially overlapping with the region. Thus the reads returned might span a region that is larger than the one queried.

Using the pileup-engine

In contrast to fetching, the pileup engine returns for each base in the reference sequence the reads that map to that particular position. In the typical view of reads stacking vertically on top of the reference sequence similar to a multiple alignment, fetching iterates over the rows of this implied multiple alignment while a pileup iterates over the columns.

Calling pileup() will return an iterator over each column (reference base) of a specified region. Each call to the iterator returns an object of the type pysam.PileupColumn that provides access to all the reads aligned to that particular reference position as well as some additional information:

iter = samfile.pileup('seq1', 10, 20)
for x in iter:
   print(str(x))

Creating BAM/CRAM/SAM files from scratch

The following example shows how a new BAM file is constructed from scratch. The important part here is that the pysam.AlignmentFile class needs to receive the sequence identifiers. These can be given either as a dictionary in a header structure, as lists of names and sizes, or from a template file. Here, we use a header dictionary:

header = { 'HD': {'VN': '1.0'},
            'SQ': [{'LN': 1575, 'SN': 'chr1'},
                   {'LN': 1584, 'SN': 'chr2'}] }

with pysam.AlignmentFile(tmpfilename, "wb", header=header) as outf:
    a = pysam.AlignedSegment()
    a.query_name = "read_28833_29006_6945"
    a.query_sequence="AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG"
    a.flag = 99
    a.reference_id = 0
    a.reference_start = 32
    a.mapping_quality = 20
    a.cigar = ((0,10), (2,1), (0,25))
    a.next_reference_id = 0
    a.next_reference_start=199
    a.template_length=167
    a.query_qualities = pysam.qualitystring_to_array("<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<")
    a.tags = (("NM", 1),
              ("RG", "L1"))
    outf.write(a)

Using streams

Pysam does not support reading and writing from true python file objects, but it does support reading and writing from stdin and stdout. The following example reads from stdin and writes to stdout:

infile = pysam.AlignmentFile("-", "r")
outfile = pysam.AlignmentFile("-", "w", template=infile)
for s in infile:
    outfile.write(s)

It will also work with BAM files. The following script converts a BAM formatted file on stdin to a SAM formatted file on stdout:

infile = pysam.AlignmentFile("-", "rb")
outfile = pysam.AlignmentFile("-", "w", template=infile)
for s in infile:
    outfile.write(s)

Note that the file open mode needs to changed from r to rb.

Using samtools and bcftools commands within Python

Commands available in samtools and bcftools are available as simple function calls, with command line options provided as arguments. For example:

import pysam.samtools
pysam.samtools.sort("-o", "output.bam", "ex1.bam", catch_stdout=False)

import pysam.bcftools
pysam.bcftools.index("--csi", "ex2.vcf.gz")

corresponds to the command lines:

samtools sort -o output.bam ex1.bam
bcftools index --csi ex2.vcf.gz

Samtools commands are also imported into the main pysam namespace. For example:

pysam.sort("-m", "1000000", "-o", "output.bam", "ex1.bam", catch_stdout=False)

To make them valid Python identifiers, the functions cram_size() and fqimport() are spelt thus, differently from their corresponding commands.

In order to get usage information, try:

print(pysam.sort.usage())

Argument errors raise a pysam.SamtoolsError:

pysam.sort()

Traceback (most recent call last):
File "x.py", line 12, in <module>
  pysam.sort()
File "/build/lib.linux-x86_64-2.6/pysam/__init__.py", line 37, in __call__
  if retval: raise SamtoolsError( "\n".join( stderr ) )
pysam.SamtoolsError: 'Usage: samtools sort [-n] [-m <maxMem>] <in.bam> <out.prefix>\n'

Messages from samtools on stderr are captured and are available using the get_messages() method:

pysam.sort.get_messages()

By default, pysam captures the samtools command’s standard output and returns it as the function’s return value. To redirect stdout to a file instead, either use the save_stdout keyword argument, or use "-o", "filename" in the arguments and also use catch_stdout=False to prevent pysam’s capturing from overriding your redirection. Finally, catch_stdout=False by itself discards standard output, which may help resolve problems in environments such as IPython notebooks:

# Return value
pileup_text = pysam.samtools.mpileup("in.bam")

# Save to file
pysam.samtools.mpileup("in.bam", save_stdout=pileup_filename)
pysam.samtools.mpileup("-o", pileup_filename, "in.bam", catch_stdout=False)

# Discard standard output
pysam.samtools.mpileup("in.bam", catch_stdout=False)  # Returns None

For each command available as a samtools subcommand, the following functions are provided:

pysam.samtools.command(args, *, catch_stdout=True, save_stdout=None, split_lines=False)
Parameters:
  • args – Arguments to be passed to the samtools subcommand.

  • catch_stdout (bool) – Whether to return stdout as the function’s value.

  • save_stdout (str) – Filename to which stdout should be written.

  • split_lines (bool) – Whether to split the return value into a list of lines.

Returns:

Standard output if it was caught, otherwise None.

If save_stdout is not None, the command’s standard ouput is written to the file specified and the function returns None.

Otherwise, if catch_stdout is true, the command’s standard output is captured and used as the function’s return value — either as a single str or as list[str] according to split_lines. If catch_stdout is false, the command’s standard output is discarded and the function returns None.

The command’s standard error is always captured and made available via get_messages().

pysam.samtools.command.get_messages()

Returns the standard error from the most recent invocation of the particular command, either as a single str or as list[str] according to split_lines as specified in that invocation.

pysam.samtools.command.usage()

Returns the command’s usage/help message, as a single str.

For each command available as a bcftools subcommand, the pysam.bcftools.command(), pysam.bcftools.command.get_messages(), and pysam.bcftools.command.usage() functions operate similarly.

Working with tabix-indexed files

To open a tabular file that has been indexed with tabix, use TabixFile:

import pysam
tbx = pysam.TabixFile("example.bed.gz")

Similar to fetch, intervals within a region can be retrieved by calling fetch():

for row in tbx.fetch("chr1", 1000, 2000):
     print(str(row))

This will return a tuple-like data structure in which columns can be retrieved by numeric index:

for row in tbx.fetch("chr1", 1000, 2000):
     print("chromosome is", row[0])

By providing a parser to fetch or TabixFile, the data will we presented in parsed form:

for row in tbx.fetch("chr1", 1000, 2000, parser=pysam.asTuple()):
     print("chromosome is", row.contig)
     print("first field (chrom)=", row[0])

Pre-built parsers are available for bed (asBed) formatted files and gtf (asGTF) formatted files. Thus, additional fields become available through named access, for example:

for row in tbx.fetch("chr1", 1000, 2000, parser=pysam.asBed()):
     print("name is", row.name)

Working with VCF/BCF formatted files

To iterate through a VCF/BCF formatted file use VariantFile:

from pysam import VariantFile

bcf_in = VariantFile("test.bcf")  # auto-detect input format
bcf_out = VariantFile('-', 'w', header=bcf_in.header)

for rec in bcf_in.fetch('chr1', 100000, 200000):
    bcf_out.write(rec)

_pysam.VariantFile.fetch() iterates over VariantRecord objects which provides access to simple variant attributes such as contig, pos, ref:

for rec in bcf_in.fetch():
    print(rec.pos)

but also to complex attributes such as the contents to the info, format and genotype columns. These complex attributes are views on the underlying htslib data structures and provide dictionary-like access to the data:

for rec in bcf_in.fetch():
    print(rec.info)
    print(rec.info.keys())
    print(rec.info["DP"])

The header attribute (VariantHeader) provides access information stored in the vcf header. The complete header can be printed:

>>> print(bcf_in.header)
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples
With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build
129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##contig=<ID=M>
##contig=<ID=17>
##contig=<ID=20>
##bcftools_viewVersion=1.3+htslib-1.3
##bcftools_viewCommand=view -O b -o example_vcf42.bcf
example_vcf42.vcf.gz
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO   FORMAT    NA00001 NA00002 NA0000

Individual contents such as contigs, info fields, samples, formats can be retrieved as attributes from header:

>>> print(bcf_in.header.contigs)
<pysam.cbcf.VariantHeaderContigs object at 0xf250f8>

To convert these views to native python types, iterate through the views:

>>> print(list((bcf_in.header.contigs)))
['M', '17', '20']
>>> print(list((bcf_in.header.filters)))
['PASS', 'q10', 's50']
>>> print(list((bcf_in.header.info)))
['NS', 'DP', 'AF', 'AA', 'DB', 'H2']
>>> print(list((bcf_in.header.samples)))
['NA00001', 'NA00002', 'NA00003']

Alternatively, it is possible to iterate through all records in the header returning objects of type VariantHeaderRecord::

>>> for x in bcf_in.header.records:
>>>    print(x)
>>>    print(x.type, x.key)
GENERIC fileformat
FILTER FILTER
GENERIC fileDate
GENERIC source
GENERIC reference
GENERIC phasing
INFO INFO
INFO INFO
INFO INFO
INFO INFO
INFO INFO
INFO INFO
FILTER FILTER
FILTER FILTER
FORMAT FORMAT
FORMAT FORMAT
FORMAT FORMAT
FORMAT FORMAT
CONTIG contig
CONTIG contig
CONTIG contig
GENERIC bcftools_viewVersion
GENERIC bcftools_viewCommand

Extending pysam

Using pyximport, it is (relatively) straight-forward to access pysam internals and the underlying samtools library. An example is provided in the tests directory. The example emulates the samtools flagstat command and consists of three files:

  1. The main script pysam_flagstat.py. The important lines in this script are:

    import pyximport
    pyximport.install()
    import _pysam_flagstat
    
    ...
    
    flag_counts = _pysam_flagstat.count(pysam_in)
    

    The first part imports, sets up pyximport and imports the cython module _pysam_flagstat. The second part calls the count method in _pysam_flagstat.

  2. The cython implementation _pysam_flagstat.pyx. This script imports the pysam API via:

    from pysam.libcalignmentfile cimport AlignmentFile, AlignedSegment
    

    This statement imports, amongst others, AlignedSegment into the namespace. Speed can be gained from declaring variables. For example, to efficiently iterate over a file, an AlignedSegment object is declared:

    # loop over samfile
    cdef AlignedSegment read
    for read in samfile:
        ...
    
  3. A pyxbld providing pyximport with build information. Required are the locations of the samtools and pysam header libraries of a source installation of pysam plus the csamtools.so shared library. For example:

    def make_ext(modname, pyxfilename):
        from distutils.extension import Extension
        import pysam
        return Extension(name=modname,
              sources=[pyxfilename],
              extra_link_args=pysam.get_libraries(),
              include_dirs=pysam.get_include(),
              define_macros=pysam.get_defines())
    

If the script pysam_flagstat.py is called the first time, pyximport will compile the cython extension _pysam_flagstat.pyx and make it available to the script. Compilation requires a working compiler and cython installation. Each time _pysam_flagstat.pyx is modified, a new compilation will take place.

pyximport comes with cython.