Regions module¶
This module provides functions and classes to work with genomic regions (also referred to as genomic intervals).
Its main classes are:
GenomicRegion
: A class that represents a genomic region/intervalRegionBased
: The base class for collections of genomic regions
The aim of this module, besides providing an intuitive set of tools operating on genomic regions and collections thereof, is to supply a unified interface for the different representations of genomic region sets. Specifically, it gives the user access to the same methods with identical syntax regardless of what type of genomic regions file the user currently works with (BED, GFF, BigWig, Tabix, …).
Most of the time, is is enough to open a file with load()
- the module
will figure out the underlying file type automatically. Please refer to the
documentation for further details.
-
class
genomic_regions.regions.
GenomicRegion
(chromosome=None, start=None, end=None, strand=None, ix=None, **kwargs)¶ Class representing a genomic region.
-
chromosome
¶ Name of the chromosome this region is located on
-
start
¶ Start position of the region in base pairs
-
end
¶ End position of the region in base pairs
-
strand
¶ Strand this region is on. Can be a str (‘+’, ‘-‘, ‘.’), None, or an int (+1, -1)
-
ix
¶ - Index of the region in the context of a set of
genomic regions.
-
as_bed_line
(score_field='score', name_field='name')¶ Return a representation of this object as line in a BED file.
- Parameters
score_field – name of the attribute to be used in the ‘score’ field of the BED line
name_field – name of the attribute to be used in the ‘name’ field of the BED line
- Returns
str
-
as_gff_line
(source_field='source', feature_field='feature', score_field='score', frame_field='frame', float_format='.2e')¶ Return a representation of this object as line in a GFF file.
- Parameters
source_field – name of the attribute to be used in the ‘source’ field of the GFF line
feature_field – name of the attribute to be used in the ‘feature’ field of the GFF line
score_field – name of the attribute to be used in the ‘score’ field of the GFF line
frame_field – name of the attribute to be used in the ‘frame’ field of the GFF line
float_format – Formatting string for the float fields
- Returns
str
-
property
attributes
¶ Return all visible attributes of this
GenomicRegion
.Returns all attribute names that do not start with an underscore. :return: list of attribute names
-
property
center
¶ Return the center coordinate of the
GenomicRegion
.- Returns
float
-
contains
(region)¶ Check if the specified region is completely contained in this region.
- Parameters
region –
GenomicRegion
object or string
-
copy
()¶ Return a (shallow) copy of this
GenomicRegion
- Returns
-
expand
(absolute=None, relative=None, absolute_left=0, absolute_right=0, relative_left=0.0, relative_right=0.0, copy=True, from_center=False)¶ Expand this region by a relative or an absolute amount.
- Parameters
absolute – Absolute amount in base pairs by which to expand the region represented by this
GenomicRegion
object on both sides. New region start will be <old start - absolute>, new region end will be <old end + absolute>relative – Relative amount as fraction of region by which to expand the region represented by this
GenomicRegion
object on both sides. New region start will be <old start - relative*len(self)>, new region end will be <old end + relative*(len(self)>absolute_left – Absolute amount in base pairs by which to expand the region represented by this
GenomicRegion
object on the left sideabsolute_right – Absolute amount in base pairs by which to expand the region represented by this
GenomicRegion
object on the right siderelative_left – Relative amount in base pairs by which to expand the region represented by this
GenomicRegion
object on the left siderelative_right – Relative amount in base pairs by which to expand the region represented by this
GenomicRegion
object on the right sidecopy – If True, return a copy of the original region, if False will modify the existing region in place
from_center – If True measures distance from center rather than start and end of the old region
- Returns
-
property
five_prime
¶ Return the position of the 5’ end of this
GenomicRegion
on the reference.- Returns
int
-
fix_chromosome
(copy=False)¶ Change chromosome representation from chr<NN> to <NN> or vice versa.
- Parameters
copy – If True, make copy of region, otherwise will modify existing region in place.
- Returns
-
classmethod
from_string
(region_string)¶ Convert a string into a
GenomicRegion
.This is a very useful convenience function to quickly define a
GenomicRegion
object from a descriptor string. Numbers can be abbreviated as ‘12k’, ‘1.5M’, etc.- Parameters
region_string – A string of the form <chromosome>[:<start>-<end>[:<strand>]] (with square brackets indicating optional parts of the string). If any optional part of the string is omitted, intuitive defaults will be chosen.
- Returns
-
is_forward
()¶ Return True if this region is on the forward strand of the reference genome.
- Returns
True if on ‘+’ strand, False otherwise.
-
is_reverse
()¶ Return True if this region is on the reverse strand of the reference genome.
- Returns
True if on ‘-‘ strand, False otherwise.
-
overlap
(region)¶ Return the overlap in base pairs between this region and another region.
- Parameters
region –
GenomicRegion
to find overlap for- Returns
overlap as int in base pairs
-
overlaps
(region)¶ Check if this region overlaps with the specified region.
- Parameters
region –
GenomicRegion
object or string
-
set_attribute
(attribute, value)¶ Safely set an attribute on the
GenomicRegion
object.This automatically decodes bytes objects into UTF-8 strings. If you do not care about this, you can also use region.<attribute> = <value> directly.
- Parameters
attribute – Name of the attribute to be set
value – Value of the attribute to be set
-
property
strand_string
¶ Return the ‘strand’ attribute as string.
- Returns
strand as str (‘+’, ‘-‘, or ‘.’)
-
property
three_prime
¶ Return the position of the 3’ end of this
GenomicRegion
on the reference.- Returns
int
-
to_string
()¶ Convert this
GenomicRegion
to its string representation.- Returns
str
-
-
class
genomic_regions.regions.
Bed
(*args, **kwargs)¶ Data type representing a BED file.
Extends
BedTool
and therefore provides all the methods of the original class, such as intersect, etc.-
merge_overlapping
(stat=<function intervals_weighted_mean>, sort=True)¶ Merge overlapping BED intervals.
- Parameters
stat – Function to use for scoring the merged interval.
sort – Sort bed file intervals by position before merging.
- Returns
iterator of merged intervals
-
-
class
genomic_regions.regions.
Bedpe
(*args, **kwargs)¶ Represents a BEDPE file (genomic region pairs).
Access each region of the pair with chromosome<1|2> start<1|2> end<1|2> strand<1|2>
-
property
file_type
¶ Return the type of the current file. One of (‘bed’,’vcf’,’gff’, ‘bam’, ‘sam’, ‘empty’).
>>> a = pybedtools.example_bedtool('a.bed') >>> print(a.file_type) bed
-
property
-
class
genomic_regions.regions.
RegionWrapper
(regions)¶ Provide
RegionBased
functionality to any list of regions.This class uses interval trees internally to provide fast region subsetting. On initialisation these trees will be generated, which might take some time.
-
chromosomes
()¶ Get a list of chromosome names.
-
-
class
genomic_regions.regions.
RegionBased
¶ Base class for working with genomic regions.
Guide for inheriting classes which functions to override:
- MUST (basic functionality):
_region_iter _get_regions
- SHOULD (works if above are implemented, but is highly inefficient):
_region_subset _region_intervals
- CAN (override for potential speed benefits or added functionality):
_region_len chromosomes chromosome_lengths region_bins
-
add_region
(region, *args, **kwargs)¶ Add a genomic region to this object.
This method offers some flexibility in the types of objects that can be loaded. See parameters for details.
- Parameters
region – Can be a
GenomicRegion
, a str in the form ‘<chromosome>:<start>-<end>[:<strand>], a dict with at least the fields ‘chromosome’, ‘start’, and ‘end’, optionally ‘ix’, or a list of length 3 (chromosome, start, end) or 4 (ix, chromosome, start, end).
-
static
bin_intervals
(intervals, bins, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)¶ Bin a given set of intervals into a fixed number of bins.
- Parameters
intervals – iterator of tuples (start, end, score)
bins – Number of bins to divide the region into
interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
- Returns
iterator of tuples: (start, end, score)
-
static
bin_intervals_equidistant
(intervals, bin_size, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)¶ Bin a given set of intervals into bins with a fixed size.
- Parameters
intervals – iterator of tuples (start, end, score)
bin_size – Size of each bin in base pairs
interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
- Returns
iterator of tuples: (start, end, score)
-
binned_regions
(region=None, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, *args, **kwargs)¶ Same as region_intervals, but returns
GenomicRegion
objects instead of tuples.- Parameters
region – String or class:~GenomicRegion object denoting the region to be binned
bins – Number of bins to divide the region into
bin_size – Size of each bin (alternative to bins argument)
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
args – Arguments passed to _region_intervals
kwargs – Keyword arguments passed to _region_intervals
- Returns
iterator of
GenomicRegion
objects
-
property
chromosome_lengths
¶ Returns a dictionary of chromosomes and their length in bp.
-
chromosomes
()¶ Get a list of chromosome names.
-
find_region
(query_regions, _regions_dict=None, _region_ends=None, _chromosomes=None)¶ Find the region that is at the center of a region.
- Parameters
query_regions – Region selector string, :class:~GenomicRegion, or list of the former
- Returns
index (or list of indexes) of the region at the center of the query region
-
intervals
(*args, **kwargs)¶ Alias for region_intervals.
-
region_bins
(region)¶ Takes a genomic region and returns a slice of the bin indices that are covered by the region.
- Parameters
region – String or class:~GenomicRegion object for which covered bins will be returned.
- Returns
slice
-
region_intervals
(region, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, score_field='score', *args, **kwargs)¶ Return equally-sized genomic intervals and associated scores.
Use either bins or bin_size argument to control binning.
- Parameters
region – String or class:~GenomicRegion object denoting the region to be binned
bins – Number of bins to divide the region into
bin_size – Size of each bin (alternative to bins argument)
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
args – Arguments passed to _region_intervals
kwargs – Keyword arguments passed to _region_intervals
- Returns
iterator of tuples: (start, end, score)
-
region_subset
(region, *args, **kwargs)¶ Takes a class:~GenomicRegion and returns all regions that overlap with the supplied region.
- Parameters
region – String or class:~GenomicRegion object for which covered bins will be returned.
-
property
regions
¶ Iterate over genomic regions in this object.
Will return a
GenomicRegion
object in every iteration. Can also be used to get the number of regions by calling len() on the object returned by this method.- Returns
RegionIter
-
property
regions_dict
¶ Return a dictionary with region index as keys and regions as values.
- Returns
dict {region.ix: region, …}
-
to_bed
(file_name, subset=None, **kwargs)¶ Export regions as BED file
- Parameters
file_name – Path of file to write regions to
subset – optional
GenomicRegion
or str to write only regions overlapping this regionkwargs – Passed to
write_bed()
-
to_bigwig
(file_name, subset=None, **kwargs)¶ Export regions as BigWig file.
- Parameters
file_name – Path of file to write regions to
subset – optional
GenomicRegion
or str to write only regions overlapping this regionkwargs – Passed to
write_bigwig()
-
to_gff
(file_name, subset=None, **kwargs)¶ Export regions as GFF file
- Parameters
file_name – Path of file to write regions to
subset – optional
GenomicRegion
or str to write only regions overlapping this regionkwargs – Passed to
write_gff()
-
class
genomic_regions.regions.
GenomicDataFrame
(*args, **kwargs)¶ Represents
DataFrame
as RegionBased object.For full functionality, must contains the columns: chromosome start end
-
chromosomes
()¶ Get a list of chromosome names.
-
-
class
genomic_regions.regions.
Tabix
(file_name, preset=None)¶ Represents a Tabix file.
Tabix-indexed files offer large speed improvements over regular BED/VCF/GFF files.
-
chromosomes
()¶ Get a list of chromosome names.
-
-
class
genomic_regions.regions.
BigWig
(bw)¶ Represents a BigWig file.
Forwards function and property calls that do not belong to RegionBased to
BigWig
.-
property
chromosome_lengths
¶ Returns a dictionary of chromosomes and their length in bp.
-
chromosomes
()¶ Get a list of chromosome names.
-
intervals
(region, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, *args, **kwargs)¶ Alias for region_intervals.
-
load_intervals_into_memory
()¶ Load entire BigWig file into memory.
May speed up interval search over slow file systems or connections.
-
region_stats
(region, bins=1, stat='mean')¶ BigWig.stats with region query.
- Parameters
region –
GenomicRegion
bins – Number of bins with stats to return
stat – name of statistic to use (default: mean)
- Returns
interval stats
-
property
-
genomic_regions.regions.
as_region
(region)¶ Convert string to
GenomicRegion
.This function attempts to convert any string passed to it to a
GenomicRegion
. Strings are expected to be of the form <chromosome>[:<start>-<end>[:[strand]], e.g. chr1:1-1000, 2:2mb-5mb:-, chrX:1.5kb-3mb, …Numbers can be abbreviated as ‘12k’, ‘1.5Mb’, etc.
When fed a
GenomicRegion
, it will simply be returned, making the use of this function as an “if-necessary” converter possible.- Parameters
region – str or
GenomicRegion
- Returns
-
genomic_regions.regions.
load
(file_name, *args, **kwargs)¶ Open file containing genomic regions as
RegionBased
object.‘Magic’ function that wraps a file containing genomic regions in a
RegionBased
interface, thus providing the same methods to different types of genomic data formats.Compatible formats include: BED, GFF/GTF, BigWig, and Tabix (i.e. BED, GFF, and compatible files index with tabix).
SAM files are also detected, but opened using pysam.AlignmentFile, so they are not compatiable with the
RegionBased
interface (yet).- Parameters
file_name – Path to genomic regions file
args – Additional arguments passed to downstream class
kwargs – Additional keyword arguments passed to downstream class
- Returns
- Raises
ValueError if file type not supported
-
genomic_regions.regions.
merge_overlapping_regions
(regions)¶ Merge overlapping regions in list.
Provided with a list of
GenomicRegion
objects, this function will determine overlapping regions and merge them. The output is a list of non-overlapping regions.- Parameters
regions –
list
ofGenomicRegion
objects- Returns
list
of mergedGenomicRegion
objects