.. _multiple-regions:

###########################################
Working with collections of genomic regions
###########################################

.. contents::
   :depth: 2


In the last section, we explored how to work with individual :class:`~GenomicRegion`
objects. This section will demonstrate how to work with lists or collections of regions,
either loaded from file in any of the supported genomic file formats (BED, GFF, BigWig,
Tabix, BEDPE, CSV or tab-delimited tables), or constructed programmatically.


This tutorial assumes you have imported the genomic_regions package like this:

.. code:: python

    import genomic_regions as gr


*******************************************
Loading or constructing RegionBased objects
*******************************************

To load genomic regions from a supported file format (see above), simply use the
:code:`load` function:

.. code:: python

    rb = gr.load("/path/to/file.bed")

:code:`rb` is now a :class:`~RegionBased` object, providing access to a number
of useful methods to work with the regions contained in it.


You can also easily construct a :class:`~RegionBased` object from existing
:class:`~GenomicRegion` lists using :class:`~RegionWrapper`:

.. code:: python

    regions = []
    for chromosome in ['chr1', 'chr2']:
        for start in range(1, 10000000, 100000):
            r = gr.GenomicRegion(chromosome, start, start + 99999,
                                 score=(start - 1)/100000)
            regions.append(r)

    rb = gr.RegionWrapper(regions)

Depending on the size of your region list, the last step could take a long time,
as it constructs an internal data representation of the region list that
facilitates fast searches across intervals. :code:`rb` is now a
:class:`~RegionBased` object.


********************************
Working with RegionBased objects
********************************

~~~~~~~~~~~~~~
Region subsets
~~~~~~~~~~~~~~

The central attribute/method of :class:`~RegionBased` objects is :code:`regions`.
When used as a property, it iterates over all regions in the object:

.. code:: python

    for region in rb.regions:
        print(region)

    # chr1:1-100000
    # chr1:100001-200000
    # ...

If supported by the specific :class:`~RegionBased` subclass (works with most file
types, otherwise a :code:`NotImplementedError` will be thrown) you can access ranges
of, or specific regions using the square bracket notation:

.. code:: python

    for region in rb.regions[0:5]:
        print(region)

    # chr1:1-100001
    # chr1:100001-200001
    # chr1:200001-300001
    # chr1:300001-400001
    # chr1:400001-500001

    print(rb.regions[1])  # chr1:100001-200001

However, the real power of :code:`regions` lies in its double-use as a method.
Without arguments, :code:`regions()` behaves exactly as :code:`regions`. By
providing a region as first argument to :code:`regions()`, you can extract
ranges of regions that overlap with the query:

.. code:: python

    for region in rb.regions('chr1:1-300k'):
        print(region)

    # chr1:1-100001
    # chr1:100001-200001
    # chr1:200001-300001

You can also change the chromosome representation on the fly:

.. code:: python

    for region in rb.regions('chr1:1-300k', fix_chromosome=True):
        print(region)

    # 1:1-100001
    # 1:100001-200001
    # 1:200001-300001

If you are interested in all the chromosomes inside the :class:`~RegionBased`
object, simply use the :code:`chromosomes()` method.


~~~~~~~~~~~~~~
Region binning
~~~~~~~~~~~~~~

If your region objects are associated with scores, i.e. each object has a
:code:`score` attribute with a float value, you can make use of the binning
functions in :class:`~RegionBased` to get binned scores in a defined interval.

The two main methods for this purpose are :code:`binned_regions`, which outputs
:class:`~GenomicRegion` objects, and :code:`region_intervals`, which simply
returns tuples of the form :code:`(start, end, score)`. Other than the
return type, the functions behave in identical fashion, so we are going to focus on
:code:`binned_regions`.

Simply provide :code:`binned_regions` with a genomic interval in the form of a string
or a :class:`~GenomicRegion`, specify the number of :code:`bins`, and you will obtain
equal-sized regions dividing the interval with scores equal to the mean of region
scores falling into each bin, weighted by the size of the associated region.

.. code:: python

    br = rb.binned_regions('chr1', bins=3)

    for region in br:
        print(region, region.score)

    # chr1:1-3333333 16.169993267997317
    # chr1:3333334-6666666 49.336625414171024
    # chr1:6666667-9999999 82.49999029411201

Alternatively, you can specify a :code:`bin_size`:

.. code:: python

    br = rb.binned_regions('chr1', bin_size=3000000)

    for region in br:
        print(region, region.score)

    # chr1:1-3000000 14.500000000000004
    # chr1:3000001-6000000 44.50000000000001
    # chr1:6000001-9000000 74.50000000000001

Note that when choosing a :code:`bin_size` directly, partial bins at the end of the
interval will be omitted.

You can control different aspects of the binning with additional parameters. Most
importantly, you can smooth the scores by choosing a :code:`smoothing_window` size
:code:`n`, which will average scores across :code:`n` neighboring bins up- and
downstream of each bin. I.e. a :code:`smoothing_window` of 2 will average across
5 bins: two to the left, the original bin, and two to the right.
If your data contains NaN values, you can replace them with a fixed value using
:code:`nan_replacement`. On the other hand, you can use :code:`zero_to_nan` to
remove scores of 0 from the calculations.