Working with collections of genomic regions¶
Contents
In the last section, we explored how to work with individual GenomicRegion
objects. This section will demonstrate how to work with lists or collections of regions,
either loaded from file in any of the supported genomic file formats (BED, GFF, BigWig,
Tabix, BEDPE, CSV or tab-delimited tables), or constructed programmatically.
This tutorial assumes you have imported the genomic_regions package like this:
import genomic_regions as gr
Loading or constructing RegionBased objects¶
To load genomic regions from a supported file format (see above), simply use the
load
function:
rb = gr.load("/path/to/file.bed")
rb
is now a RegionBased
object, providing access to a number
of useful methods to work with the regions contained in it.
You can also easily construct a RegionBased
object from existing
GenomicRegion
lists using RegionWrapper
:
regions = []
for chromosome in ['chr1', 'chr2']:
for start in range(1, 10000000, 100000):
r = gr.GenomicRegion(chromosome, start, start + 99999,
score=(start - 1)/100000)
regions.append(r)
rb = gr.RegionWrapper(regions)
Depending on the size of your region list, the last step could take a long time,
as it constructs an internal data representation of the region list that
facilitates fast searches across intervals. rb
is now a
RegionBased
object.
Working with RegionBased objects¶
Region subsets¶
The central attribute/method of RegionBased
objects is regions
.
When used as a property, it iterates over all regions in the object:
for region in rb.regions:
print(region)
# chr1:1-100000
# chr1:100001-200000
# ...
If supported by the specific RegionBased
subclass (works with most file
types, otherwise a NotImplementedError
will be thrown) you can access ranges
of, or specific regions using the square bracket notation:
for region in rb.regions[0:5]:
print(region)
# chr1:1-100001
# chr1:100001-200001
# chr1:200001-300001
# chr1:300001-400001
# chr1:400001-500001
print(rb.regions[1]) # chr1:100001-200001
However, the real power of regions
lies in its double-use as a method.
Without arguments, regions()
behaves exactly as regions
. By
providing a region as first argument to regions()
, you can extract
ranges of regions that overlap with the query:
for region in rb.regions('chr1:1-300k'):
print(region)
# chr1:1-100001
# chr1:100001-200001
# chr1:200001-300001
You can also change the chromosome representation on the fly:
for region in rb.regions('chr1:1-300k', fix_chromosome=True):
print(region)
# 1:1-100001
# 1:100001-200001
# 1:200001-300001
If you are interested in all the chromosomes inside the RegionBased
object, simply use the chromosomes()
method.
Region binning¶
If your region objects are associated with scores, i.e. each object has a
score
attribute with a float value, you can make use of the binning
functions in RegionBased
to get binned scores in a defined interval.
The two main methods for this purpose are binned_regions
, which outputs
GenomicRegion
objects, and region_intervals
, which simply
returns tuples of the form (start, end, score)
. Other than the
return type, the functions behave in identical fashion, so we are going to focus on
binned_regions
.
Simply provide binned_regions
with a genomic interval in the form of a string
or a GenomicRegion
, specify the number of bins
, and you will obtain
equal-sized regions dividing the interval with scores equal to the mean of region
scores falling into each bin, weighted by the size of the associated region.
br = rb.binned_regions('chr1', bins=3)
for region in br:
print(region, region.score)
# chr1:1-3333333 16.169993267997317
# chr1:3333334-6666666 49.336625414171024
# chr1:6666667-9999999 82.49999029411201
Alternatively, you can specify a bin_size
:
br = rb.binned_regions('chr1', bin_size=3000000)
for region in br:
print(region, region.score)
# chr1:1-3000000 14.500000000000004
# chr1:3000001-6000000 44.50000000000001
# chr1:6000001-9000000 74.50000000000001
Note that when choosing a bin_size
directly, partial bins at the end of the
interval will be omitted.
You can control different aspects of the binning with additional parameters. Most
importantly, you can smooth the scores by choosing a smoothing_window
size
n
, which will average scores across n
neighboring bins up- and
downstream of each bin. I.e. a smoothing_window
of 2 will average across
5 bins: two to the left, the original bin, and two to the right.
If your data contains NaN values, you can replace them with a fixed value using
nan_replacement
. On the other hand, you can use zero_to_nan
to
remove scores of 0 from the calculations.