Regions module

This module provides functions and classes to work with genomic regions (also referred to as genomic intervals).

Its main classes are:

  • GenomicRegion: A class that represents a genomic region/interval

  • RegionBased: The base class for collections of genomic regions

The aim of this module, besides providing an intuitive set of tools operating on genomic regions and collections thereof, is to supply a unified interface for the different representations of genomic region sets. Specifically, it gives the user access to the same methods with identical syntax regardless of what type of genomic regions file the user currently works with (BED, GFF, BigWig, Tabix, …).

Most of the time, is is enough to open a file with load() - the module will figure out the underlying file type automatically. Please refer to the documentation for further details.

class genomic_regions.regions.GenomicRegion(chromosome=None, start=None, end=None, strand=None, ix=None, **kwargs)

Class representing a genomic region.

chromosome

Name of the chromosome this region is located on

start

Start position of the region in base pairs

end

End position of the region in base pairs

strand

Strand this region is on. Can be a str (‘+’, ‘-‘, ‘.’), None, or an int (+1, -1)

ix
Index of the region in the context of a set of

genomic regions.

as_bed_line(score_field='score', name_field='name')

Return a representation of this object as line in a BED file.

Parameters
  • score_field – name of the attribute to be used in the ‘score’ field of the BED line

  • name_field – name of the attribute to be used in the ‘name’ field of the BED line

Returns

str

as_gff_line(source_field='source', feature_field='feature', score_field='score', frame_field='frame', float_format='.2e')

Return a representation of this object as line in a GFF file.

Parameters
  • source_field – name of the attribute to be used in the ‘source’ field of the GFF line

  • feature_field – name of the attribute to be used in the ‘feature’ field of the GFF line

  • score_field – name of the attribute to be used in the ‘score’ field of the GFF line

  • frame_field – name of the attribute to be used in the ‘frame’ field of the GFF line

  • float_format – Formatting string for the float fields

Returns

str

property attributes

Return all visible attributes of this GenomicRegion.

Returns all attribute names that do not start with an underscore. :return: list of attribute names

property center

Return the center coordinate of the GenomicRegion.

Returns

float

contains(region)

Check if the specified region is completely contained in this region.

Parameters

regionGenomicRegion object or string

copy()

Return a (shallow) copy of this GenomicRegion

Returns

GenomicRegion

expand(absolute=None, relative=None, absolute_left=0, absolute_right=0, relative_left=0.0, relative_right=0.0, copy=True, from_center=False)

Expand this region by a relative or an absolute amount.

Parameters
  • absolute – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on both sides. New region start will be <old start - absolute>, new region end will be <old end + absolute>

  • relative – Relative amount as fraction of region by which to expand the region represented by this GenomicRegion object on both sides. New region start will be <old start - relative*len(self)>, new region end will be <old end + relative*(len(self)>

  • absolute_left – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on the left side

  • absolute_right – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on the right side

  • relative_left – Relative amount in base pairs by which to expand the region represented by this GenomicRegion object on the left side

  • relative_right – Relative amount in base pairs by which to expand the region represented by this GenomicRegion object on the right side

  • copy – If True, return a copy of the original region, if False will modify the existing region in place

  • from_center – If True measures distance from center rather than start and end of the old region

Returns

GenomicRegion

property five_prime

Return the position of the 5’ end of this GenomicRegion on the reference.

Returns

int

fix_chromosome(copy=False)

Change chromosome representation from chr<NN> to <NN> or vice versa.

Parameters

copy – If True, make copy of region, otherwise will modify existing region in place.

Returns

GenomicRegion

classmethod from_string(region_string)

Convert a string into a GenomicRegion.

This is a very useful convenience function to quickly define a GenomicRegion object from a descriptor string. Numbers can be abbreviated as ‘12k’, ‘1.5M’, etc.

Parameters

region_string – A string of the form <chromosome>[:<start>-<end>[:<strand>]] (with square brackets indicating optional parts of the string). If any optional part of the string is omitted, intuitive defaults will be chosen.

Returns

GenomicRegion

is_forward()

Return True if this region is on the forward strand of the reference genome.

Returns

True if on ‘+’ strand, False otherwise.

is_reverse()

Return True if this region is on the reverse strand of the reference genome.

Returns

True if on ‘-‘ strand, False otherwise.

overlap(region)

Return the overlap in base pairs between this region and another region.

Parameters

regionGenomicRegion to find overlap for

Returns

overlap as int in base pairs

overlaps(region)

Check if this region overlaps with the specified region.

Parameters

regionGenomicRegion object or string

set_attribute(attribute, value)

Safely set an attribute on the GenomicRegion object.

This automatically decodes bytes objects into UTF-8 strings. If you do not care about this, you can also use region.<attribute> = <value> directly.

Parameters
  • attribute – Name of the attribute to be set

  • value – Value of the attribute to be set

property strand_string

Return the ‘strand’ attribute as string.

Returns

strand as str (‘+’, ‘-‘, or ‘.’)

property three_prime

Return the position of the 3’ end of this GenomicRegion on the reference.

Returns

int

to_string()

Convert this GenomicRegion to its string representation.

Returns

str

class genomic_regions.regions.Bed(*args, **kwargs)

Data type representing a BED file.

Extends BedTool and therefore provides all the methods of the original class, such as intersect, etc.

merge_overlapping(stat=<function intervals_weighted_mean>, sort=True)

Merge overlapping BED intervals.

Parameters
  • stat – Function to use for scoring the merged interval.

  • sort – Sort bed file intervals by position before merging.

Returns

iterator of merged intervals

class genomic_regions.regions.Bedpe(*args, **kwargs)

Represents a BEDPE file (genomic region pairs).

Access each region of the pair with chromosome<1|2> start<1|2> end<1|2> strand<1|2>

property file_type

Return the type of the current file. One of (‘bed’,’vcf’,’gff’, ‘bam’, ‘sam’, ‘empty’).

>>> a = pybedtools.example_bedtool('a.bed')
>>> print(a.file_type)
bed
class genomic_regions.regions.RegionWrapper(regions)

Provide RegionBased functionality to any list of regions.

This class uses interval trees internally to provide fast region subsetting. On initialisation these trees will be generated, which might take some time.

chromosomes()

Get a list of chromosome names.

class genomic_regions.regions.RegionBased

Base class for working with genomic regions.

Guide for inheriting classes which functions to override:

MUST (basic functionality):

_region_iter _get_regions

SHOULD (works if above are implemented, but is highly inefficient):

_region_subset _region_intervals

CAN (override for potential speed benefits or added functionality):

_region_len chromosomes chromosome_lengths region_bins

add_region(region, *args, **kwargs)

Add a genomic region to this object.

This method offers some flexibility in the types of objects that can be loaded. See parameters for details.

Parameters

region – Can be a GenomicRegion, a str in the form ‘<chromosome>:<start>-<end>[:<strand>], a dict with at least the fields ‘chromosome’, ‘start’, and ‘end’, optionally ‘ix’, or a list of length 3 (chromosome, start, end) or 4 (ix, chromosome, start, end).

static bin_intervals(intervals, bins, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)

Bin a given set of intervals into a fixed number of bins.

Parameters
  • intervals – iterator of tuples (start, end, score)

  • bins – Number of bins to divide the region into

  • interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.

  • smoothing_window – Size of window (in bins) to smooth scores over

  • nan_replacement – NaN values in the scores will be replaced with this value

  • zero_to_nan – If True, will convert bins with score 0 to NaN

Returns

iterator of tuples: (start, end, score)

static bin_intervals_equidistant(intervals, bin_size, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)

Bin a given set of intervals into bins with a fixed size.

Parameters
  • intervals – iterator of tuples (start, end, score)

  • bin_size – Size of each bin in base pairs

  • interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.

  • smoothing_window – Size of window (in bins) to smooth scores over

  • nan_replacement – NaN values in the scores will be replaced with this value

  • zero_to_nan – If True, will convert bins with score 0 to NaN

Returns

iterator of tuples: (start, end, score)

binned_regions(region=None, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, *args, **kwargs)

Same as region_intervals, but returns GenomicRegion objects instead of tuples.

Parameters
  • region – String or class:~GenomicRegion object denoting the region to be binned

  • bins – Number of bins to divide the region into

  • bin_size – Size of each bin (alternative to bins argument)

  • smoothing_window – Size of window (in bins) to smooth scores over

  • nan_replacement – NaN values in the scores will be replaced with this value

  • zero_to_nan – If True, will convert bins with score 0 to NaN

  • args – Arguments passed to _region_intervals

  • kwargs – Keyword arguments passed to _region_intervals

Returns

iterator of GenomicRegion objects

property chromosome_lengths

Returns a dictionary of chromosomes and their length in bp.

chromosomes()

Get a list of chromosome names.

find_region(query_regions, _regions_dict=None, _region_ends=None, _chromosomes=None)

Find the region that is at the center of a region.

Parameters

query_regions – Region selector string, :class:~GenomicRegion, or list of the former

Returns

index (or list of indexes) of the region at the center of the query region

intervals(*args, **kwargs)

Alias for region_intervals.

region_bins(region)

Takes a genomic region and returns a slice of the bin indices that are covered by the region.

Parameters

region – String or class:~GenomicRegion object for which covered bins will be returned.

Returns

slice

region_intervals(region, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, score_field='score', *args, **kwargs)

Return equally-sized genomic intervals and associated scores.

Use either bins or bin_size argument to control binning.

Parameters
  • region – String or class:~GenomicRegion object denoting the region to be binned

  • bins – Number of bins to divide the region into

  • bin_size – Size of each bin (alternative to bins argument)

  • smoothing_window – Size of window (in bins) to smooth scores over

  • nan_replacement – NaN values in the scores will be replaced with this value

  • zero_to_nan – If True, will convert bins with score 0 to NaN

  • args – Arguments passed to _region_intervals

  • kwargs – Keyword arguments passed to _region_intervals

Returns

iterator of tuples: (start, end, score)

region_subset(region, *args, **kwargs)

Takes a class:~GenomicRegion and returns all regions that overlap with the supplied region.

Parameters

region – String or class:~GenomicRegion object for which covered bins will be returned.

property regions

Iterate over genomic regions in this object.

Will return a GenomicRegion object in every iteration. Can also be used to get the number of regions by calling len() on the object returned by this method.

Returns

RegionIter

property regions_dict

Return a dictionary with region index as keys and regions as values.

Returns

dict {region.ix: region, …}

to_bed(file_name, subset=None, **kwargs)

Export regions as BED file

Parameters
  • file_name – Path of file to write regions to

  • subset – optional GenomicRegion or str to write only regions overlapping this region

  • kwargs – Passed to write_bed()

to_bigwig(file_name, subset=None, **kwargs)

Export regions as BigWig file.

Parameters
  • file_name – Path of file to write regions to

  • subset – optional GenomicRegion or str to write only regions overlapping this region

  • kwargs – Passed to write_bigwig()

to_gff(file_name, subset=None, **kwargs)

Export regions as GFF file

Parameters
  • file_name – Path of file to write regions to

  • subset – optional GenomicRegion or str to write only regions overlapping this region

  • kwargs – Passed to write_gff()

class genomic_regions.regions.GenomicDataFrame(*args, **kwargs)

Represents DataFrame as RegionBased object.

For full functionality, must contains the columns: chromosome start end

chromosomes()

Get a list of chromosome names.

class genomic_regions.regions.Tabix(file_name, preset=None)

Represents a Tabix file.

Tabix-indexed files offer large speed improvements over regular BED/VCF/GFF files.

chromosomes()

Get a list of chromosome names.

class genomic_regions.regions.BigWig(bw)

Represents a BigWig file.

Forwards function and property calls that do not belong to RegionBased to BigWig.

property chromosome_lengths

Returns a dictionary of chromosomes and their length in bp.

chromosomes()

Get a list of chromosome names.

intervals(region, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, *args, **kwargs)

Alias for region_intervals.

load_intervals_into_memory()

Load entire BigWig file into memory.

May speed up interval search over slow file systems or connections.

region_stats(region, bins=1, stat='mean')

BigWig.stats with region query.

Parameters
  • regionGenomicRegion

  • bins – Number of bins with stats to return

  • stat – name of statistic to use (default: mean)

Returns

interval stats

genomic_regions.regions.as_region(region)

Convert string to GenomicRegion.

This function attempts to convert any string passed to it to a GenomicRegion. Strings are expected to be of the form <chromosome>[:<start>-<end>[:[strand]], e.g. chr1:1-1000, 2:2mb-5mb:-, chrX:1.5kb-3mb, …

Numbers can be abbreviated as ‘12k’, ‘1.5Mb’, etc.

When fed a GenomicRegion, it will simply be returned, making the use of this function as an “if-necessary” converter possible.

Parameters

region – str or GenomicRegion

Returns

GenomicRegion

genomic_regions.regions.load(file_name, *args, **kwargs)

Open file containing genomic regions as RegionBased object.

‘Magic’ function that wraps a file containing genomic regions in a RegionBased interface, thus providing the same methods to different types of genomic data formats.

Compatible formats include: BED, GFF/GTF, BigWig, and Tabix (i.e. BED, GFF, and compatible files index with tabix).

SAM files are also detected, but opened using pysam.AlignmentFile, so they are not compatiable with the RegionBased interface (yet).

Parameters
  • file_name – Path to genomic regions file

  • args – Additional arguments passed to downstream class

  • kwargs – Additional keyword arguments passed to downstream class

Returns

RegionBased

Raises

ValueError if file type not supported

genomic_regions.regions.merge_overlapping_regions(regions)

Merge overlapping regions in list.

Provided with a list of GenomicRegion objects, this function will determine overlapping regions and merge them. The output is a list of non-overlapping regions.

Parameters

regionslist of GenomicRegion objects

Returns

list of merged GenomicRegion objects