Working with collections of genomic regions

In the last section, we explored how to work with individual GenomicRegion objects. This section will demonstrate how to work with lists or collections of regions, either loaded from file in any of the supported genomic file formats (BED, GFF, BigWig, Tabix, BEDPE, CSV or tab-delimited tables), or constructed programmatically.

This tutorial assumes you have imported the genomic_regions package like this:

import genomic_regions as gr

Loading or constructing RegionBased objects

To load genomic regions from a supported file format (see above), simply use the load function:

rb = gr.load("/path/to/file.bed")

rb is now a RegionBased object, providing access to a number of useful methods to work with the regions contained in it.

You can also easily construct a RegionBased object from existing GenomicRegion lists using RegionWrapper:

regions = []
for chromosome in ['chr1', 'chr2']:
    for start in range(1, 10000000, 100000):
        r = gr.GenomicRegion(chromosome, start, start + 99999,
                             score=(start - 1)/100000)
        regions.append(r)

rb = gr.RegionWrapper(regions)

Depending on the size of your region list, the last step could take a long time, as it constructs an internal data representation of the region list that facilitates fast searches across intervals. rb is now a RegionBased object.

Working with RegionBased objects

Region subsets

The central attribute/method of RegionBased objects is regions. When used as a property, it iterates over all regions in the object:

for region in rb.regions:
    print(region)

# chr1:1-100000
# chr1:100001-200000
# ...

If supported by the specific RegionBased subclass (works with most file types, otherwise a NotImplementedError will be thrown) you can access ranges of, or specific regions using the square bracket notation:

for region in rb.regions[0:5]:
    print(region)

# chr1:1-100001
# chr1:100001-200001
# chr1:200001-300001
# chr1:300001-400001
# chr1:400001-500001

print(rb.regions[1])  # chr1:100001-200001

However, the real power of regions lies in its double-use as a method. Without arguments, regions() behaves exactly as regions. By providing a region as first argument to regions(), you can extract ranges of regions that overlap with the query:

for region in rb.regions('chr1:1-300k'):
    print(region)

# chr1:1-100001
# chr1:100001-200001
# chr1:200001-300001

You can also change the chromosome representation on the fly:

for region in rb.regions('chr1:1-300k', fix_chromosome=True):
    print(region)

# 1:1-100001
# 1:100001-200001
# 1:200001-300001

If you are interested in all the chromosomes inside the RegionBased object, simply use the chromosomes() method.

Region binning

If your region objects are associated with scores, i.e. each object has a score attribute with a float value, you can make use of the binning functions in RegionBased to get binned scores in a defined interval.

The two main methods for this purpose are binned_regions, which outputs GenomicRegion objects, and region_intervals, which simply returns tuples of the form (start, end, score). Other than the return type, the functions behave in identical fashion, so we are going to focus on binned_regions.

Simply provide binned_regions with a genomic interval in the form of a string or a GenomicRegion, specify the number of bins, and you will obtain equal-sized regions dividing the interval with scores equal to the mean of region scores falling into each bin, weighted by the size of the associated region.

br = rb.binned_regions('chr1', bins=3)

for region in br:
    print(region, region.score)

# chr1:1-3333333 16.169993267997317
# chr1:3333334-6666666 49.336625414171024
# chr1:6666667-9999999 82.49999029411201

Alternatively, you can specify a bin_size:

br = rb.binned_regions('chr1', bin_size=3000000)

for region in br:
    print(region, region.score)

# chr1:1-3000000 14.500000000000004
# chr1:3000001-6000000 44.50000000000001
# chr1:6000001-9000000 74.50000000000001

Note that when choosing a bin_size directly, partial bins at the end of the interval will be omitted.

You can control different aspects of the binning with additional parameters. Most importantly, you can smooth the scores by choosing a smoothing_window size n, which will average scores across n neighboring bins up- and downstream of each bin. I.e. a smoothing_window of 2 will average across 5 bins: two to the left, the original bin, and two to the right. If your data contains NaN values, you can replace them with a fixed value using nan_replacement. On the other hand, you can use zero_to_nan to remove scores of 0 from the calculations.