Custom RegionBased subclasses¶
Are you working with a file format that is not natively supported by
genomic_regions
? This guide will help you subclass RegionBased
yourself, so you can make use of the RegionBased
functionality using
any data format.
To subclass RegionBased
, you need to override a couple of
methods that form the basis of all other RegionBased
methods:
__init__()
: Use this to make data-type specific initialisations_region_iter()
: Provides basic functionality to iterate over regions_get_regions()
: Provides basic functionality to specifically select regions
With the above methods you get all basic RegionBased
functionality,
but for additional speed benefits you should also override:
_region_subset()
: Speeds up region selection by interval_region_len()
: Return the number of regions in the object
In addition, you may override any of the other methods to speed them up, such
as the chromosome list chromosomes
.
In the following, we will use the simple RegionWrapper
as implemented
in this module for illustration:
__init__¶
RegionWrapper
uses a simple list to store GenomicRegion
objects, and interval trees from the intervaltree
module
to allow the region subsetting.
from genomic_regions import RegionBased
from collections import defaultdict
import intervaltree
These are set up in the __init__
method. Each chromosome gets a
separate interval tree, which is stored in a dict
:
class RegionWrapper(RegionBased):
def __init__(self, regions):
super(RegionWrapper, self).__init__()
region_intervals = defaultdict(list) # temporary variable used to hold intervals
self._regions = [] # internal list of regions
for i, region in enumerate(regions):
self._regions.append(region)
# in the "data" argument, we store both the
# region and its original position in the list
interval = intervaltree.Interval(region.start - 1, region.end, data=(i, region))
region_intervals[region.chromosome].append(interval)
self.region_trees = {}
for chromosome, intervals in region_intervals.items():
self.region_trees[chromosome] = intervaltree.IntervalTree(intervals)
_region_iter¶
To iterate over the regions, we simply iterate over the regions list.
_region_iter
should return an iterator:
def _region_iter(self, *args, **kwargs):
for region in self._regions:
yield region
_get_regions¶
To select specific regions, we can also use basic list subsetting:
def _get_regions(self, item, *args, **kwargs):
return self._regions[item]
_region_subset¶
Due to the use of interval trees, region subsetting is also not very complicated:
def _region_subset(self, region, *args, **kwargs):
# select the intervaltree by chromosome
tree = self.region_trees[region.chromosome]
# we sort by the region position int he list here, because that information
# is lost in intervaltree
intervals = sorted(tree[region.start - 1:region.end], key=lambda r: r.data[0])
for interval in intervals:
yield interval.data[1] # iterate over the overlapping regions
other methods¶
Finally, we override two additional methods: _region_len
and chromosomes
.
Both of these would normally be calculated by iterating over all regions to obtain
the necessary information, but we can speed this up greatly by relying on the
internal data structure we chose for RegionWrapper
.
def _region_len(self):
return len(self._regions)
def chromosomes(self):
return list(self.region_trees.keys())
And that is all you need to subclass