RegionPairsContainer¶

This interface provides common properties and functions to data based on pairs of regions. A typical example in this regard would be pairs of ligated fragments in a Hi-C library, as represented within FAN-C by the ReadPairs class. But also matrix-based data, such as in Hic can be interpreted as scores between pairs of binned genomic regions, hence it also supports the RegionPairsContainer interface. After loading a dataset using load(), you can check for support of the RegionPairsContainer interface with:

hic = fanc.load("examples/output/hic/binned/fanc_example_500kb.hic")
isinstance(hic, fanc.matrix.RegionPairsContainer)  # True if interface supported

The current list of FAN-C classes supporting RegionPairsContainer is: ReadPairs, CoolerHic, JuicerHic, Hic, ABCompartmentMatrix, DifferenceMatrix, FoldChangeMatrix, PeakInfo, and RaoPeakInfo.

The Edge class¶

In RegionPairsContainer objects, the basic data type is Edge. It “connects” two GenomicRegion objects, called “nodes” within the context of the edge, and supports arbitrary property annotations, which typically include an edge “weight”. Additionally, regions are typically associated with an index ix, which refers to their position in the region list of the genomic_regions.RegionBased container.

# create a few regions
region1 = fanc.GenomicRegion(chromosome='chr1', start=1, end=1000, ix=0)
region2 = fanc.GenomicRegion(chromosome='chr1', start=1001, end=2000, ix=1)
region3 = fanc.GenomicRegion(chromosome='chr2', start=1, end=1000, ix=2)

# connect regions with edges
edge1 = fanc.Edge(region1, region2, weight=10)
edge2 = fanc.Edge(region1, region3, weight=1, foo='test')
edge3 = fanc.Edge(region2, region3, weight=2, bar=[1, 2, 3, 4])

The underlying regions can be accessed using the source_node and sink_node properties:

region1 = edge1.source_node
region2 = edge1.sink_node

Accessing regions in this way can be computationally inefficient (depending on the internal structure of the container), which is why the recommended way of accessing regions connected by an edge is through their region index. This is a number describing each region’s location in the list of regions in that container.

edge1.source  # 0
edge1.sink  # 1

the corresponding regions can then be looked up in the region list:

source_region = hic.regions[edge1.source]
sink_region = hic.regions[edge1.sink]

of, if processing a large number of edges this way, it is more efficient to obtain the list of regions in advance:

regions = list(hic.regions)
for edge in hic.edges:
    region1 = regions[edge.source]
    region2 = regions[edge.sink]

It is possible to create an edge using only region indexes, without directly linking it to a region object:

edge_ix = fanc.Edge(0, 1, weight=0.2)

A call to source_node or sink_node will then raise a ValueError.

Hic object edges and many other objects have a “weight” property, describing, for example, their (normalised) ligation frequency or contact probability. This properties value is internally multiplied by the value of edge.bias for correction on the fly:

edge1.weight  # returns "raw" edge weight multiplied by edge.bias
edge1.bias  # return the "correction factor" that is applied to weight

If, for example, an Edge is created like this:

edge = fanc.Edge(0, 3, weight=10)
print(edge)  # 0--3; bias: 1.0; weight: 10.0
print(edge.weight)  # 10.0

its weight will be as assigned at object creation time. We can modify that value indirectly by changing the bias:

edge.bias = 0.5
print(edge.weight)  # 5.0

Other properties are unaffected by the bias value, and can be assigned arbitrarily:

edge.foo = 1.
edge.bar = 'test'
edge.baz = [1, 2, 3, 4]

Note, however, that not all data types support saving those arbitrary properties in the container, and that most of the time you are working with copies of data stored in the container, which remains unmodified even if the edge object is changed. There are exemptions from this, which will be discussed below.

The edges function¶

RegionPairsContainer compatible objects are built on lists of regions, which can be accessed and queried using the regions() function (see RegionBased), and lists of edges. This section shows how these edge lists can be queried using the edges() function.

In its simplest form, edges() can simply be used as a property and returns an iterator over all edges in the object. We can use this, for example, to count the edges in the object (not the sum of weights):

edge_counter = 0
for edge in hic.edges:
    edge_counter += 1

We can also do this much more simply (and efficiently) by using the built-in len function:

len(hic.edges)

The real power of edges(), however, lies in its use as a function:

edge_counter = 0
for edge in hic.edges():
    edge_counter += 1

Note the (). This works exactly as the above command without the function call, but now we can introduce additional arguments. For example, to only iterate over intra-chromosomal edges, simply do:

edge_counter = 0
for edge in hic.edges(inter_chromosomal=False):
    edge_counter += 1

To only return edges connected to chromosome 19, do:

edge_counter = 0
for edge in hic.edges('chr19'):
    edge_counter += 1

Importantly, this returns all edges where either source, or sink, or both connected nodes are on chromosome 19, including, for example, inter-chromosomal edges between chromosome 18 and 19. To only return edges within chromosome 19 (source and sink), you can combine this with the inter_chromosomal=False parameter. However, it is generally more efficient to use 2-dimensional selectors:

# only return intra-chromosomal edges on chromosome 19
edge_counter = 0
for edge in hic.edges(('chr19', 'chr19')):
    edge_counter += 1

# only return inter-chromosomal edges between chromosome 18 and 19
edge_counter = 0
for edge in hic.edges(('chr18', 'chr19')):
    edge_counter += 1

Selectors support arbitrary region definitions and human-readable abbreviations:

edge_counter = 0
for edge in hic.edges(('chr19:1mb-15mb', 'chr19:30.5mb-45000000')):
    edge_counter += 1

Of course, selectors also directly support GenomicRegion objects:

edge_counter = 0
region = fanc.GenomicRegion(chromosome='chr19', start=6000000, end=18000000)
for edge in hic.edges((region, region)):
    edge_counter += 1

When dealing with normalised data (such as balanced Hi-C matrices), the returned edge weights are already normalised. You can disable the normalisation on the fly using the norm=False argument:

valid_pairs = 0
for edge in hic.edges(norm=False):
    valid_pairs += edge.weight  # here, the edge weight is an unnormalised integer

As shown above, this can be used to count the number of valid pairs in the object, for example.

Lazy evaluation¶

Hi-C datasets are often very large, with hundreds of millions, even billions of valid pairs in the matrix. The process of creating a unique Edge object for every matrix entry can thus cumulatively take a significant amount of time. For this reason, FAN-C offers lazy evaluation of edge properties. When enabled, edge data is only read when it is requested. This, for example, avoids reading from file when it is not absolutely necessary, and saves on time during object creation and population. Edge iterators support lazy evaluation through the lazy argument:

weights = []
for edge in hic.edges(lazy=True):
    weights.append(edge.weight)

This is significantly faster than the default lazy=False iteration. However, using lazy iterators it is easy to encounter confusing situations. Because data is only provided when explicitly requested from an edge, the following code does not work as intended:

# THIS IS WRONG!
edges = []
for edge in hic.edges(lazy=True):
    edges.append(edge)

print(edges[0].source, edges[0].sink, edges[0].weight)
# (159, 248, 0.002386864930511163)
print(edges[1].source, edges[1].sink, edges[1].weight)
# (159, 248, 0.002386864930511163)
print(edges[2].source, edges[2].sink, edges[2].weight)
# (159, 248, 0.002386864930511163)
# ...

All edges in the list are identical! This is because lazy iterations use only a single instance of the LazyEdge object, which simply points to the data location in the edge in the current iteration. This pointer is replaced for the following edge in the next iteration, but the actual object remains the same. The result is a list of the same LazyEdge object pointing to the same edge data.

Here is the correct way of obtaining data using lazy iterators:

edge_data = []
for edge in hic.edges(lazy=True):
    edge_data.append((edge.source, edge.sink, edge.weight))

print(edge_data[0][0], edge_data[0][1], edge_data[0][2])
# 84 85 0.08227859035794333
print(edge_data[1][0], edge_data[1][1], edge_data[1][2])
# 48 56 0.012760965147510347
print(edge_data[2][0], edge_data[2][1], edge_data[2][2])
# 10 153 0.0056418570804748725
# ...

The example accesses the edge data in the loop and stores it independently of the LazyEdge object. Lazy iterators can greatly speed up your analyses, but be very careful working with them!

Another useful feature of lazy iterators is that they support data modification for native FAN-C objects. For example, you double the edge weight of each edge in the object like this:

for edge in hic.edges(lazy=True):
    edge.weight *= 2
    edge.update()

This only works for files opened in append mode (mode='a'), and will throw an error for files opened in read-only mode. The update() function ensures data is written to file after modifying it.

Warning

Modifying edges this way can come with unwanted consequences, and is highly discouraged. Potential issues include a bias vector that no longer matches, possible duplicate edges with unknown effect on analysis functions, messed up mappability calculations and more. We always recommend to create an object from scratch with the properties you want instead of modifying an existing one.

Other functions¶

The edges() iterator is the most versatile and useful function in the RegionPairsContainer interface, but by far not the only one. We will briefly list the most useful other functions with a brief description and examples here.

regions_and_edges¶

The regions_and_edges() function returns three objects: a list of regions corresponding to the selected matrix rows; a list of regions corresponding to the selected matrix columns; and an edge iterator over the matrix subset spanned by the selector.

row_regions, col_regions, edges_iter = hic.regions_and_edges(('chr18', 'chr19'))
for edge in edges_iter:
    row_region = row_regions[edge.source]
    col_region = row_regions[edge.sink]
    # ...

This is simpler and saves computation time over having separate function calls to regions() and edges().

edge_data¶

The function edge_data() iterates over only a specific edge attribute for a selection of edges:

weights = list(hic.edge_data('weight', ('chr19', 'chr19')))

# is identical to
weights = []
for weight in hic.edge_data('weight', ('chr19', 'chr19')):
    weights.append(weight)

# is identical to
weights = []
# for edge in hic.edges(('chr19', 'chr19'), lazy=True):
#     weights.append(edge.weight)

mappable¶

mappable() returns a boolean vector where each entry corresponds to a region in the object. If the vector entry is True, the regions has at least one edge connected to it, i.e. a non-zero matrix entry, otherwise the entry is False.

m = hic.mappable()
# or for a specific region
m_sub = hic.mappable('chr19')

The next interface builds on the RegionPairsContainer() structure to populate matrices with edge weights: RegionMatrixContainer.