Custom RegionBased subclasses

Are you working with a file format that is not natively supported by genomic_regions? This guide will help you subclass RegionBased yourself, so you can make use of the RegionBased functionality using any data format.

To subclass RegionBased, you need to override a couple of methods that form the basis of all other RegionBased methods:

  • __init__(): Use this to make data-type specific initialisations

  • _region_iter(): Provides basic functionality to iterate over regions

  • _get_regions(): Provides basic functionality to specifically select regions

With the above methods you get all basic RegionBased functionality, but for additional speed benefits you should also override:

  • _region_subset(): Speeds up region selection by interval

  • _region_len(): Return the number of regions in the object

In addition, you may override any of the other methods to speed them up, such as the chromosome list chromosomes.

In the following, we will use the simple RegionWrapper as implemented in this module for illustration:

__init__

RegionWrapper uses a simple list to store GenomicRegion objects, and interval trees from the intervaltree module to allow the region subsetting.

from genomic_regions import RegionBased
from collections import defaultdict
import intervaltree

These are set up in the __init__ method. Each chromosome gets a separate interval tree, which is stored in a dict:

class RegionWrapper(RegionBased):
    def __init__(self, regions):
        super(RegionWrapper, self).__init__()

        region_intervals = defaultdict(list)  # temporary variable used to hold intervals

        self._regions = []  # internal list of regions
        for i, region in enumerate(regions):
            self._regions.append(region)

            # in the "data" argument, we store both the
            # region and its original position in the list
            interval = intervaltree.Interval(region.start - 1, region.end, data=(i, region))
            region_intervals[region.chromosome].append(interval)

        self.region_trees = {}
        for chromosome, intervals in region_intervals.items():
            self.region_trees[chromosome] = intervaltree.IntervalTree(intervals)

_region_iter

To iterate over the regions, we simply iterate over the regions list. _region_iter should return an iterator:

def _region_iter(self, *args, **kwargs):
    for region in self._regions:
        yield region

_get_regions

To select specific regions, we can also use basic list subsetting:

def _get_regions(self, item, *args, **kwargs):
    return self._regions[item]

_region_subset

Due to the use of interval trees, region subsetting is also not very complicated:

def _region_subset(self, region, *args, **kwargs):
    # select the intervaltree by chromosome
    tree = self.region_trees[region.chromosome]

    # we sort by the region position int he list here, because that information
    # is lost in intervaltree
    intervals = sorted(tree[region.start - 1:region.end], key=lambda r: r.data[0])
    for interval in intervals:
        yield interval.data[1]  # iterate over the overlapping regions

other methods

Finally, we override two additional methods: _region_len and chromosomes. Both of these would normally be calculated by iterating over all regions to obtain the necessary information, but we can speed this up greatly by relying on the internal data structure we chose for RegionWrapper.

def _region_len(self):
    return len(self._regions)

def chromosomes(self):
    return list(self.region_trees.keys())

And that is all you need to subclass