PCAΒΆ
To follow this tutorial, download the FAN-C example data, for example through our Keeper library. Then set up your Python session like this, loading some of our previously published Low-C datasets:
import fanc
import fanc.plotting as fancplot
from fanc.architecture.comparisons import hic_pca
lowc_hindiii_100k = fanc.load("architecture/other-hic/lowc_hindiii_100k_1mb.hic")
lowc_hindiii_5M = fanc.load("architecture/other-hic/lowc_hindiii_5M_1mb.hic")
lowc_mboi_1M = fanc.load("architecture/other-hic/lowc_mboi_1M_1mb.hic")
lowc_mboi_100k = fanc.load("architecture/other-hic/lowc_mboi_100k_1mb.hic")
lowc_mboi_50k = fanc.load("architecture/other-hic/lowc_mboi_50k_1mb.hic")
PCA is one way in FAN-C to compare different Hi-C matrices to each other. A matrix of pixels vs matrices is assembled that contains the (normalised) contact strength of each matrix for the respective pixel (=region pair). PCA is then run on this matrix, and the resulting eigenvectors can be plotted to examine the variability between datasets.
In FAN-C, simply use hic_pca()
for this purpose,
as shown here for chromosome 19:
pca_info, pca_result = hic_pca(lowc_hindiii_5M, lowc_hindiii_100k,
lowc_mboi_1M, lowc_mboi_50k,
lowc_mboi_100k, ignore_zeros=True,
scale=False, region='chr19', sample_size=100000,
strategy='variance')
We can plot the result using pca_plot()
:
fig, ax = fancplot.pca_plot(pca_result, variance=pca_info,
names=["HindIII 5M", "HindIII 100k",
"MboI 1M", "MboI 50k", "MboI 100k"])
We can easily change the colors and markers, for example by colouring all samples with MboI and HindIII differently, and assigning different markers to samples with more or less than 1M cells:
fig, ax = fancplot.pca_plot(pca_result, variance=pca_info,
names=["HindIII 5M", "HindIII 100k",
"MboI 1M", "MboI 50k", "MboI 100k"],
colors=["orange", "orange", "cyan", "cyan", "cyan"],
markers=["*", "o", "*", "o", "o"])
Sometimes the first eigenvector captures the library sequencing depth, so you may want to
plot the second and third EVs instead using eigenvectors=(1,2)
(in this case it does
not seem to be particularly informative):
fig, ax = fancplot.pca_plot(pca_result, variance=pca_info,
eigenvectors=(1, 2),
names=["HindIII 5M", "HindIII 100k",
"MboI 1M", "MboI 50k", "MboI 100k"],
colors=["orange", "orange", "cyan", "cyan", "cyan"],
markers=["*", "o", "*", "o", "o"])
This kind of analysis can be tricky, and selecting informative pixels from the matrix can
be key to getting a robust and intuitive PCA result. In the above example, we are using
several parameters to select informative pixels for the PCA. First, we are only using
pixels that are non-zero in all samples with ignore_zeros=True
. Second, we are sorting
the pixels, listing the ones with the largest variance across samples first, using
strategy='variance'
. Finally, we are selecting the top 100k pixels (those with the
largest variance) first with sample_size=100000
.
When you are analysing matrices of higher resolution, pixels far away from the diagonal
might be dominated by noise. ignore_zeros
removes most of the noisy pixels, but
additionally you might want to set a max_distance
to only select pixels corresponding
to regions closer than this value. Similarly, if you want to exclude contacts close to
the diagonal, use min_distance
.
For more options have a look at the API reference for pca_plot()
.