RegionMatrixContainer¶
This interface simplifies and unifies working with matrix data in the context of genomic
region pairs, such as you would find in a Hi-C matrix. It builds on the
RegionPairsContainer (see previous section RegionPairsContainer),
which dealt with scores and other attributes between genomic regions, and extends it by
adding functions for representing the scores in a numeric matrix.
After loading a dataset using load(), you can check for
support of the RegionMatrixContainer interface with:
hic = fanc.load("examples/output/hic/binned/fanc_example_500kb.hic")
isinstance(hic, fanc.matrix.RegionMatrixContainer) # True if interface supported
The current list of FAN-C classes supporting RegionMatrixContainer is:
CoolerHic,
JuicerHic,
Hic,
ABCompartmentMatrix,
DifferenceMatrix,
FoldChangeMatrix,
PeakInfo,
and
RaoPeakInfo.
The matrix function¶
To obtain the whole-genome, normalised matrix from an object, use the
matrix() function:
m = hic.matrix()
Of course, the matrix() function supports matrix subsets:
m_chr19 = hic.matrix(('chr19', 'chr19'))
When using tuples as keys, the first entry will select the rows, and the second entry the columns of the matrix:
m_inter1 = hic.matrix(('chr18', 'chr19'))
m_inter1.shape # (157, 119)
m_inter2 = hic.matrix(('chr19', 'chr18'))
m_inter2.shape # (119, 157)
The returned object is of type RegionMatrix, which is a subclass
of Numpy’s masked array with added perks for genomic region handling.
A RegionMatrix can be used like any other numpy matrix,
for example calculating marginals by summing up values in rows or columns:
m_chr19.shape # (119, 119)
marginals = np.sum(m_chr19, axis=0)
marginals.shape # (119,)
marginals[:5]
# [1.0000000074007467, 0.9999999585562779,
# 1.0000000102533806, 0.999999987196381, 1.0000000140165086]
(this Hi-C object is normalised on a per-chromosome basis, so each marginal will be close to 1)
Rows and columns in a matrix can be masked, i.e. their entries are considered invalid and
are ignored for most downstream analysis to prevent artifacts. By default, FAN-C masks
regions that have no edges (after filtering), typically due to mappability issues.
You can turn off masking using the mask=False parameter:
m_unmasked = hic.matrix(mask=False)
However, we recommend working with masked matrices to ensure no unwanted edges are part of your analyses.
RegionMatrix objects also keep track of the regions corresponding to
columns and rows of a matrix:
m_inter1.row_regions
# [chr18:1-500000,
# chr18:500001-1000000,
# chr18:1000001-1500000,
# chr18:1500001-2000000,
# ...
m_inter1.col_regions
# [chr19:1-500000,
# chr19:500001-1000000,
# chr19:1000001-1500000,
# chr19:1500001-2000000,
# ...
You can subset a RegionMatrix using indexes or region intervals:
# subset by index
m_chr19_sub1 = m_chr19[0:3, 0:3]
m_chr19_sub1.row_regions
# [chr19:1-500000, chr19:500001-1000000, chr19:1000001-1500000]
m_chr19_sub1.col_regions
# [chr19:1-500000, chr19:500001-1000000, chr19:1000001-1500000]
# subset by region interval
m_chr19_sub2 = m_chr19['chr19:2mb-3mb', 'chr19:500kb-1mb']
m_chr19_sub2.row_regions
# [chr19:1500001-2000000, chr19:2000001-2500000, chr19:2500001-3000000]
m_chr19_sub2.col_regions
Note that region interval definitions are always interpreted as 1-based, inclusive, and any
overlapping region is returned (in the above example the region chr19:150001-200000
has a 1 base overlap with the requested interval).
matrix() supports all arguments also available for
edges(), but it is not necessary to use lazy loading.
You can, for example, output an uncorrected matrix with
m_chr19_uncorr = hic.matrix(('chr19', 'chr19'), norm=False)
In addition, there are several parameters specific to
matrix(). Most notably, you can use the
oe=True parameter to return an observed/expected (O/E) matrix:
m_chr19_oe = hic.matrix(('chr19', 'chr19'), oe=True)
Internally, oe=True uses
expected_values to calculate the expected
(average) weight of all edges connecting regions at a certain distance. The matrix
is then simply divided by the expected matrix. You may want to log2-transform the
matrix for a symmetric scale of values:
m_chr19_log_oe = hic.matrix(('chr19', 'chr19'), oe=True, log=True)