RegionMatrixContainer¶
This interface simplifies and unifies working with matrix data in the context of genomic
region pairs, such as you would find in a Hi-C matrix. It builds on the
RegionPairsContainer
(see previous section RegionPairsContainer),
which dealt with scores and other attributes between genomic regions, and extends it by
adding functions for representing the scores in a numeric matrix.
After loading a dataset using load()
, you can check for
support of the RegionMatrixContainer
interface with:
hic = fanc.load("examples/output/hic/binned/fanc_example_500kb.hic")
isinstance(hic, fanc.matrix.RegionMatrixContainer) # True if interface supported
The current list of FAN-C classes supporting RegionMatrixContainer
is:
CoolerHic
,
JuicerHic
,
Hic
,
ABCompartmentMatrix
,
DifferenceMatrix
,
FoldChangeMatrix
,
PeakInfo
,
and
RaoPeakInfo
.
The matrix function¶
To obtain the whole-genome, normalised matrix from an object, use the
matrix()
function:
m = hic.matrix()
Of course, the matrix()
function supports matrix subsets:
m_chr19 = hic.matrix(('chr19', 'chr19'))
When using tuples as keys, the first entry will select the rows, and the second entry the columns of the matrix:
m_inter1 = hic.matrix(('chr18', 'chr19'))
m_inter1.shape # (157, 119)
m_inter2 = hic.matrix(('chr19', 'chr18'))
m_inter2.shape # (119, 157)
The returned object is of type RegionMatrix
, which is a subclass
of Numpy’s masked array
with added perks for genomic region handling.
A RegionMatrix
can be used like any other numpy matrix,
for example calculating marginals by summing up values in rows or columns:
m_chr19.shape # (119, 119)
marginals = np.sum(m_chr19, axis=0)
marginals.shape # (119,)
marginals[:5]
# [1.0000000074007467, 0.9999999585562779,
# 1.0000000102533806, 0.999999987196381, 1.0000000140165086]
(this Hi-C object is normalised on a per-chromosome basis, so each marginal will be close to 1)
Rows and columns in a matrix can be masked, i.e. their entries are considered invalid and
are ignored for most downstream analysis to prevent artifacts. By default, FAN-C masks
regions that have no edges (after filtering), typically due to mappability issues.
You can turn off masking using the mask=False
parameter:
m_unmasked = hic.matrix(mask=False)
However, we recommend working with masked matrices to ensure no unwanted edges are part of your analyses.
RegionMatrix
objects also keep track of the regions corresponding to
columns and rows of a matrix:
m_inter1.row_regions
# [chr18:1-500000,
# chr18:500001-1000000,
# chr18:1000001-1500000,
# chr18:1500001-2000000,
# ...
m_inter1.col_regions
# [chr19:1-500000,
# chr19:500001-1000000,
# chr19:1000001-1500000,
# chr19:1500001-2000000,
# ...
You can subset a RegionMatrix
using indexes or region intervals:
# subset by index
m_chr19_sub1 = m_chr19[0:3, 0:3]
m_chr19_sub1.row_regions
# [chr19:1-500000, chr19:500001-1000000, chr19:1000001-1500000]
m_chr19_sub1.col_regions
# [chr19:1-500000, chr19:500001-1000000, chr19:1000001-1500000]
# subset by region interval
m_chr19_sub2 = m_chr19['chr19:2mb-3mb', 'chr19:500kb-1mb']
m_chr19_sub2.row_regions
# [chr19:1500001-2000000, chr19:2000001-2500000, chr19:2500001-3000000]
m_chr19_sub2.col_regions
Note that region interval definitions are always interpreted as 1-based, inclusive, and any
overlapping region is returned (in the above example the region chr19:150001-200000
has a 1 base overlap with the requested interval).
matrix()
supports all arguments also available for
edges()
, but it is not necessary to use lazy loading.
You can, for example, output an uncorrected matrix with
m_chr19_uncorr = hic.matrix(('chr19', 'chr19'), norm=False)
In addition, there are several parameters specific to
matrix()
. Most notably, you can use the
oe=True
parameter to return an observed/expected (O/E) matrix:
m_chr19_oe = hic.matrix(('chr19', 'chr19'), oe=True)
Internally, oe=True
uses
expected_values
to calculate the expected
(average) weight of all edges connecting regions at a certain distance. The matrix
is then simply divided by the expected matrix. You may want to log2-transform the
matrix for a symmetric scale of values:
m_chr19_log_oe = hic.matrix(('chr19', 'chr19'), oe=True, log=True)