Working with individual genomic regions¶
Contents
Genomic intervals, or regions, as we call them from here on, are represented
by a api/regions/GenomicRegion
object. This object has attributes
commonly associated with genomic regions, such as “chromosome”, “start” and “end”,
but can in principle have arbitrary attributes, including scores, labels, and other
useful properties describing the region. There is no restriction regarding the types
of attributes - any valid Python object can be associated with a
GenomicRegion
.
This tutorial assumes you have imported the genomic_regions package like this:
import genomic_regions as gr
Creating genomic regions¶
You can create a genomic region by calling the GenomicRegion
constructor:
region = gr.GenomicRegion(chromosome='chr1', start=1, end=1000)
# or to simplify
region = gr.GenomicRegion('chr1', 1, 1000)
start
and end
must be of type int.
The strand
attribute also has special restrictions. It can either be a str
("+"
, "-"
, "."
), an int (-1
, +1
) or None
.
region = gr.GenomicRegion('chr1', 1, 1000, strand='+')
Other attributes have no restrictions, but we recommend that score
be a float,
to show the expected behavior when working with the region later on.
region = gr.GenomicRegion('chr1', 1, 1000, strand='+',
score=1.765, name="my region",
my_list=[1, 2, 3, 4])
You can also add attributes to the region later on by using the set_attribute
method:
region = gr.GenomicRegion('chr1', 1, 1000, strand='+')
region.set_attribute("my_dict", {'a': 1, 'b': 2})
We advise the use of set_attribute
rather than the builtin setattr
region.my_dict = {'a': 1, 'b': 2}
, as some processing is done to the
key, value pair by GenomicRegion
for compatibility.
Finally, you can quickly create GenomicRegion
objects from strings using the
as_region
convenience function:
region = gr.as_region('chr1:1-1000:+')
The region string should have the format <chromosome>:<start>-<end>[:<strand>]
.
start
and end
can use common abbreviations for kilo- and megabases,
support decimal and thousand separators, and are case-insensitive,
so writing gr.as_region('chr12:12500000-18000000')
is the same as
gr.as_region('chr12:12.5Mb-18Mb')
and gr.as_region('chr12:12.5mb-18mb')
and
gr.as_region('chr12:12,500,000-18,000,000')
.
Genomic region methods¶
Basic¶
The GenomicRegion
object comes loaded with useful attributes and methods,
most of which are self-explanatory:
len(region) # returns the size of the region in base pairs
region.center # returns the base (or fraction of base) at the center of the region
region.five_prime # returns the starting base at the 5' end of the region
region.three_prime # returns the starting base at the 3' end of the region
region.is_forward() # True if strand is '+' or '+1'
region.is_reverse() # True if strand is '-' or '-1'
region.attributes # return all attribute names in this region object
region.copy() # return a shallow copy of this region
region.to_string() # return a region identifier string describing the region
The strand
attribute returns an integer (or None
, if no strand is set).
To obtain a string, use the method strand_string
, which returns one of
+
, -
, or .
.
Modifying the region¶
Some methods are provided that modify the underlying region.
region.expand
changes the size of the region on the chromosome, either by an
absolute amount in base pairs (using any of the parameters absolute
,
absolute_left
, or absolute_right
), or relative, as a fraction of the
current region size (relative
, relative_left
, or relative_right
).
By default, these actions return a modified copy of the original region, but you can
modify the region in place using copy=True
.
region = gr.as_region('chr12:12.5Mb-18Mb')
print(region) # chr12:12500000-18000000
new_region = region.expand(absolute='1mb')
print(new_region) # chr12:11500000-19000000
print(region) # chr12:12500000-18000000
region.expand(relative=1.5, copy=False)
print(region) # chr12:4250000-26250000
You can also easily move a region on the same chromosome by adding or subtracting base pairs.
region = gr.as_region('chr12:12.5Mb-18Mb')
new_region = region + 1000000
print(new_region) # chr12:13500000-19000000
Some databases store chromosome names with the ‘chr’ prefix, others without. You can use
the method fix_chromosome
to switch between chromosome formats:
region = gr.as_region('chr12:12.5Mb-18Mb')
new_region = region.fix_chromosome()
print(new_region) # 12:12500000-18000000
Relationship to other regions¶
You can easily check if a region overlaps with another region:
region = gr.as_region('chr12:12.5Mb-18Mb')
region.overlaps('chr12:11Mb-13Mb') # True
region.overlaps('chr12:11Mb-11.5Mb') # False
region.overlaps('chr1:11Mb-13Mb') # False
Similarly, you can get the extent of the overlap in base pairs:
region = gr.as_region('chr12:12.5Mb-18Mb')
region.overlap('chr12:11Mb-13Mb') # 500000
region.overlap('chr12:11Mb-11.5Mb') # 0
Next steps¶
Next, we will see how to work with lists and collections of regions in Working with collections of genomic regions.