RegionPairsContainer¶
This interface provides common properties and functions to data based on pairs of regions.
A typical example in this regard would be pairs of ligated fragments in a Hi-C library, as
represented within FAN-C by the ReadPairs
class. But also matrix-based
data, such as in Hic
can be interpreted as scores between pairs of
binned genomic regions, hence it also supports the RegionPairsContainer
interface. After loading a dataset using load()
, you can check for
support of the RegionPairsContainer
interface with:
hic = fanc.load("examples/output/hic/binned/fanc_example_500kb.hic")
isinstance(hic, fanc.matrix.RegionPairsContainer) # True if interface supported
The current list of FAN-C classes supporting RegionPairsContainer
is:
ReadPairs
,
CoolerHic
,
JuicerHic
,
Hic
,
ABCompartmentMatrix
,
DifferenceMatrix
,
FoldChangeMatrix
,
PeakInfo
,
and
RaoPeakInfo
.
The Edge class¶
In RegionPairsContainer
objects, the basic data type is
Edge
. It “connects” two GenomicRegion
objects, called “nodes” within the context of the edge, and supports arbitrary
property annotations, which typically include an edge “weight”. Additionally,
regions are typically associated with an index ix
, which refers to their
position in the region list of the genomic_regions.RegionBased
container.
# create a few regions
region1 = fanc.GenomicRegion(chromosome='chr1', start=1, end=1000, ix=0)
region2 = fanc.GenomicRegion(chromosome='chr1', start=1001, end=2000, ix=1)
region3 = fanc.GenomicRegion(chromosome='chr2', start=1, end=1000, ix=2)
# connect regions with edges
edge1 = fanc.Edge(region1, region2, weight=10)
edge2 = fanc.Edge(region1, region3, weight=1, foo='test')
edge3 = fanc.Edge(region2, region3, weight=2, bar=[1, 2, 3, 4])
The underlying regions can be accessed using the source_node
and
sink_node
properties:
region1 = edge1.source_node
region2 = edge1.sink_node
Accessing regions in this way can be computationally inefficient (depending on the internal structure of the container), which is why the recommended way of accessing regions connected by an edge is through their region index. This is a number describing each region’s location in the list of regions in that container.
edge1.source # 0
edge1.sink # 1
the corresponding regions can then be looked up in the region list:
source_region = hic.regions[edge1.source]
sink_region = hic.regions[edge1.sink]
of, if processing a large number of edges this way, it is more efficient to obtain the list of regions in advance:
regions = list(hic.regions)
for edge in hic.edges:
region1 = regions[edge.source]
region2 = regions[edge.sink]
It is possible to create an edge using only region indexes, without directly linking it to a region object:
edge_ix = fanc.Edge(0, 1, weight=0.2)
A call to source_node
or sink_node
will then raise a ValueError
.
Hic
object edges and many other objects have a “weight”
property, describing, for example, their (normalised) ligation frequency or contact
probability. This properties value is internally multiplied by the value of
edge.bias
for correction on the fly:
edge1.weight # returns "raw" edge weight multiplied by edge.bias
edge1.bias # return the "correction factor" that is applied to weight
If, for example, an Edge
is created like this:
edge = fanc.Edge(0, 3, weight=10)
print(edge) # 0--3; bias: 1.0; weight: 10.0
print(edge.weight) # 10.0
its weight will be as assigned at object creation time. We can modify that value indirectly by changing the bias:
edge.bias = 0.5
print(edge.weight) # 5.0
Other properties are unaffected by the bias value, and can be assigned arbitrarily:
edge.foo = 1.
edge.bar = 'test'
edge.baz = [1, 2, 3, 4]
Note, however, that not all data types support saving those arbitrary properties in the container, and that most of the time you are working with copies of data stored in the container, which remains unmodified even if the edge object is changed. There are exemptions from this, which will be discussed below.
The edges function¶
RegionPairsContainer
compatible objects are built on lists of
regions, which can be accessed and queried using the regions()
function (see RegionBased), and lists of edges. This section shows how these
edge lists can be queried using the edges()
function.
In its simplest form, edges()
can simply be used as
a property and returns an iterator over all edges in the object. We can use this, for example,
to count the edges in the object (not the sum of weights):
edge_counter = 0
for edge in hic.edges:
edge_counter += 1
We can also do this much more simply (and efficiently) by using the built-in len
function:
len(hic.edges)
The real power of edges()
, however, lies in its use as
a function:
edge_counter = 0
for edge in hic.edges():
edge_counter += 1
Note the ()
. This works exactly as the above command without the function call, but now
we can introduce additional arguments. For example, to only iterate over intra-chromosomal
edges, simply do:
edge_counter = 0
for edge in hic.edges(inter_chromosomal=False):
edge_counter += 1
To only return edges connected to chromosome 19, do:
edge_counter = 0
for edge in hic.edges('chr19'):
edge_counter += 1
Importantly, this returns all edges where either source, or sink, or both connected nodes are
on chromosome 19, including, for example, inter-chromosomal edges between chromosome 18 and 19.
To only return edges within chromosome 19 (source and sink), you can combine this with
the inter_chromosomal=False
parameter. However, it is generally more efficient to use
2-dimensional selectors:
# only return intra-chromosomal edges on chromosome 19
edge_counter = 0
for edge in hic.edges(('chr19', 'chr19')):
edge_counter += 1
# only return inter-chromosomal edges between chromosome 18 and 19
edge_counter = 0
for edge in hic.edges(('chr18', 'chr19')):
edge_counter += 1
Selectors support arbitrary region definitions and human-readable abbreviations:
edge_counter = 0
for edge in hic.edges(('chr19:1mb-15mb', 'chr19:30.5mb-45000000')):
edge_counter += 1
Of course, selectors also directly support GenomicRegion
objects:
edge_counter = 0
region = fanc.GenomicRegion(chromosome='chr19', start=6000000, end=18000000)
for edge in hic.edges((region, region)):
edge_counter += 1
When dealing with normalised data (such as balanced Hi-C matrices), the returned edge weights
are already normalised. You can disable the normalisation on the fly using the
norm=False
argument:
valid_pairs = 0
for edge in hic.edges(norm=False):
valid_pairs += edge.weight # here, the edge weight is an unnormalised integer
As shown above, this can be used to count the number of valid pairs in the object, for example.
Lazy evaluation¶
Hi-C datasets are often very large, with hundreds of millions, even billions of valid pairs
in the matrix. The process of creating a unique Edge
object for every
matrix entry can thus cumulatively take a significant amount of time. For this reason, FAN-C
offers lazy evaluation of edge properties. When enabled, edge data is only
read when it is requested. This, for example, avoids reading from file when it is not absolutely
necessary, and saves on time during object creation and population. Edge iterators support
lazy evaluation through the lazy
argument:
weights = []
for edge in hic.edges(lazy=True):
weights.append(edge.weight)
This is significantly faster than the default lazy=False
iteration. However, using
lazy iterators it is easy to encounter confusing situations. Because data is only provided
when explicitly requested from an edge, the following code does not work as intended:
# THIS IS WRONG!
edges = []
for edge in hic.edges(lazy=True):
edges.append(edge)
print(edges[0].source, edges[0].sink, edges[0].weight)
# (159, 248, 0.002386864930511163)
print(edges[1].source, edges[1].sink, edges[1].weight)
# (159, 248, 0.002386864930511163)
print(edges[2].source, edges[2].sink, edges[2].weight)
# (159, 248, 0.002386864930511163)
# ...
All edges in the list are identical! This is because lazy iterations use only a single
instance of the LazyEdge
object, which simply points to the
data location in the edge in the current iteration. This pointer is replaced for the
following edge in the next iteration, but the actual object remains the same. The result
is a list of the same LazyEdge
object pointing to the same edge
data.
Here is the correct way of obtaining data using lazy iterators:
edge_data = []
for edge in hic.edges(lazy=True):
edge_data.append((edge.source, edge.sink, edge.weight))
print(edge_data[0][0], edge_data[0][1], edge_data[0][2])
# 84 85 0.08227859035794333
print(edge_data[1][0], edge_data[1][1], edge_data[1][2])
# 48 56 0.012760965147510347
print(edge_data[2][0], edge_data[2][1], edge_data[2][2])
# 10 153 0.0056418570804748725
# ...
The example accesses the edge data in the loop and stores it independently of the
LazyEdge
object. Lazy iterators can greatly speed up your analyses,
but be very careful working with them!
Another useful feature of lazy iterators is that they support data modification for native FAN-C objects. For example, you double the edge weight of each edge in the object like this:
for edge in hic.edges(lazy=True):
edge.weight *= 2
edge.update()
This only works for files opened in append mode (mode='a'
), and will throw an
error for files opened in read-only mode. The update()
function
ensures data is written to file after modifying it.
Warning
Modifying edges this way can come with unwanted consequences, and is highly discouraged. Potential issues include a bias vector that no longer matches, possible duplicate edges with unknown effect on analysis functions, messed up mappability calculations and more. We always recommend to create an object from scratch with the properties you want instead of modifying an existing one.
Other functions¶
The edges()
iterator is the most versatile and useful
function in the RegionPairsContainer
interface, but by far not the only one.
We will briefly list the most useful other functions with a brief description and examples here.
regions_and_edges¶
The regions_and_edges()
function returns three objects:
a list of regions corresponding to the selected matrix rows; a list of regions corresponding to
the selected matrix columns; and an edge iterator over the matrix subset spanned by the selector.
row_regions, col_regions, edges_iter = hic.regions_and_edges(('chr18', 'chr19'))
for edge in edges_iter:
row_region = row_regions[edge.source]
col_region = row_regions[edge.sink]
# ...
This is simpler and saves computation time over having separate function calls to
regions()
and edges()
.
edge_data¶
The function edge_data()
iterates over only a specific
edge attribute for a selection of edges:
weights = list(hic.edge_data('weight', ('chr19', 'chr19')))
# is identical to
weights = []
for weight in hic.edge_data('weight', ('chr19', 'chr19')):
weights.append(weight)
# is identical to
weights = []
# for edge in hic.edges(('chr19', 'chr19'), lazy=True):
# weights.append(edge.weight)
mappable¶
mappable()
returns a boolean vector where each entry
corresponds to a region in the object. If the vector entry is True
, the regions has
at least one edge connected to it, i.e. a non-zero matrix entry, otherwise the entry is False
.
m = hic.mappable()
# or for a specific region
m_sub = hic.mappable('chr19')
The next interface builds on the RegionPairsContainer()
structure to populate
matrices with edge weights: RegionMatrixContainer.