Hi-C aggregate analysis

Note

The following examples use the matrix files in FAN-C format. If you want to try the same commands using Juicer .hic files, replace output/hic/binned/fanc_example_100kb.hic with architecture/other-hic/fanc_example.juicer.hic@100kb. If you want to work with Cooler files in this tutorial, use architecture/other-hic/fanc_example.mcool@100kb. The results will be minimally different due to the “zooming” and balancing applied by each package.

It can be very informative to view the average Hi-C matrix for a set of regions, rather than the Hi-C matrix at each individual region. This can help you in identifying common structural features of these regions or over-/under-representation of contacts in the vicinity.

Here are examples of TAD and loop aggregate plots from our recent preprint (Kruse et al. (2019)):

../../_images/aggregate_examples_biorxiv.png

You can easily create your own aggregate plots using fanc aggregate.

usage: fanc aggregate [-h] [-m MATRIX_FILE] [-p PLOT_FILE] [--tads]
                      [--tads-imakaev] [--loops]
                      [--loop-strength LOOP_STRENGTH_FILE]
                      [--tad-strength TAD_STRENGTH_FILE] [-w WINDOW]
                      [--pixels PIXELS] [-v REGION_VIEWPOINT]
                      [-b BOUNDARY_MODE] [-i INTERPOLATION] [-r RELATIVE]
                      [-a ABSOLUTE] [-e] [-l] [--rescale]
                      [--colormap COLORMAP] [--vmin VMIN] [--vmax VMAX] [-tmp]
                      [-C] [--keep-submatrices] [-s] [--labels LABELS]
                      [--label-locations LABEL_LOCATIONS]
                      input [regions] [output]

Positional Arguments

input

FAN-C matrix file (e.g. Hic)

regions

File with regions (BED, GFF, Tabix, …) or region pairs (BEDPE)

output

Output AggregateMatrix file for further processing.See -p and -m option for aggregate plot and matrix, respectively.

Named Arguments

-m, --save-matrix

Path to save aggregate matrix (numpy txt format)

-p, --save-plot

Path to save aggregate plot (PDF)

--tads

Use presets for aggregate TADs: –relative 1.0 –expected –log –vmin -1 –vmax 1

--tads-imakaev

Use presets for aggregate TADs: –relative 1.0 –expected–rescale

--loops

Use presets for aggregate loops: –pixels 16 -l

--loop-strength

Calculate loop strengths and save to file. Only works when providing BEDPE file, and Hi-C matrix.

--tad-strength

Calculate tad strengths and save to file. Only works with –tads preset

-w, --window

Width of the region window used for aggregation. If set, will only use the center position from the input regions and extract a submatrix of width -w around this region.

--pixels

Width of the output image in pixels. Default: 90

-v, --region-viewpoint

Viewpoint relative to region when using -w. By default, this measures the window from the region center. You can change this to other locations within each region using this parameter. Possible values:start, end, five_prime, three_prime, center

-b, --boundary-mode

Points outside the boundaries of the input are filled according to the given mode. Options areconstant, edge, symmetrix, reflect, and warp.Default: reflect.

-i, --interpolation

Type of interpolation to use for resizing. 0: Nearest-neighbor (default), 1: Bi-linear, 2: Bi-quadratic, 3: Bi-cubic, 4: Bi-quartic, 5: Bi-quintic

-r, --relative

Relative extension of each region as fraction of region length (l). Final region in the image will be: <start of region - e*l> to <end of region + e*l>. Default: 1.0 (results in 3 times region size image). Additive with “-a” parameter!

-a, --absolute

Extension (e) of each region in base pairs. Final region in the image will be: <start of TAD - e> to <end of TAD + e>. Default: 0 (no extension). Additive with “-r” parameter!

-e, --expected-norm

Normalize matrix to expected values

-l, --log

log2-transform normalized matrices. Only used in conjunction with “-e”.

--rescale

Rescale normalized contact matrices using an a=-0.25 power law. Only used in conjunction with “-e”.

--colormap

Matplotlib colormap to use for matrix

--vmin

Minimum saturation value in image

--vmax

Maximum saturation value in image

-tmp, --work-in-tmp

Work in temporary directory

-C, --no-cache

Do not cache chromosome matrices. Slower, but saves a lot of memory. Use this if you are having trouble with memory usage.

--keep-submatrices

Save all the individual matrices that make up the aggregate matrix in the output object. Useful for debugging and downstream processing. Potentially uses a lot of memory and/or disk space.

-s, --orient-by-strand

Flip submatrix if region is on the negative strand.

--labels

Labels for the left, center, and right edge of the matrix (comma-separated).

--label-locations

Relative location of ticks on bottom and left of aggregate plot (comma-separated). Ranges from 0 (left/bottom) to 1.0 (right/top). Default: 0,0.5,1.0

You can provide fanc aggregate with a list of genomic regions in any of the common region-based formats (BED, GFF, BigWig, …) or with a list of genomic region pairs in BEDPE format. For lists of regions, the aggregate matrix will be located at the Hi-C matrix diagonal. For pairs of regions, matrix subsets can be anywhere in the genome.

Aggregate over variable size regions

By default, if you provide fanc aggregate with a list of regions, it will extract the square Hi-C sub-matrices along the diagonal for each region and interpolate them to match the width set by --pixels (90 by default). It will then calculate the average value for each pixel, which then form the aggregate matrix.

Let’s try this on TADs called using the arrowhead algorithm (Rao and Huntley et al., 2014). fanc aggregate will ignore all regions in the file that are not present in the Hi-C matrix. In our example Hic file, that is everything outside of chromosomes 18 and 19:

fanc aggregate output/hic/binned/fanc_example_100kb.hic \
               architecture/domains/gm12878_tads.bed \
               architecture/aggregate/fanc_example_100kb.agg

This command only produces an AggregateMatrix file (fanc_example_100kb.agg), which is useful for further usage with FAN-C, but not easily readable. To extract the aggregate matrix in txt format, simply add -m and to plot it just use -p:

fanc aggregate output/hic/binned/fanc_example_100kb.hic \
               architecture/domains/gm12878_tads.bed \
               architecture/aggregate/fanc_example_100kb.agg \
               -p architecture/aggregate/fanc_example_100kb.agg.png \
               -m architecture/aggregate/fanc_example_100kb.agg.txt

The resulting plot looks a bit strange:

../../_images/fanc_example_100kb.agg.png

Important note: if your input regions have variable sizes, as assumed in this section, the resulting aggregate matrix with default settings is highly misleading, since smaller regions will have larger average signal in the interpolated matrix due to being closer to the diagonal. You can easily correct for this effect using O/E matrices instead of the regular Hi-C matrix. Simply set the -e flag for this. -e works very well with log2-transformed data (-l). Let’s see how this changes your matrix:

fanc aggregate output/hic/binned/fanc_example_100kb.hic \
               architecture/domains/gm12878_tads.bed \
               architecture/aggregate/fanc_example_100kb.agg \
               -p architecture/aggregate/fanc_example_100kb_oe.agg.png \
               -m architecture/aggregate/fanc_example_100kb_oe.agg.txt \
               -e -l
../../_images/fanc_example_100kb_oe.agg.png

This still does not look like much of a TAD, but we can add a little more context by expanding the plotting region relative to the region size using -r:

fanc aggregate output/hic/binned/fanc_example_100kb.hic \
               architecture/domains/gm12878_tads.bed \
               architecture/aggregate/fanc_example_100kb.agg \
               -p architecture/aggregate/fanc_example_100kb_oe_large.agg.png \
               -m architecture/aggregate/fanc_example_100kb_oe_large.agg.txt \
               -e -l -r 1.0
../../_images/fanc_example_100kb_oe_large.agg.png

That plot depicts a region that is 3x the size of the TAD located in its center and already looks like we would expect: High signal in the center, especially at the TAD corner, where the corner loops are typically located.

We can further apply an exponential rescaling (--rescale) of the data to make this look more like a Hi-C matrix:

fanc aggregate output/hic/binned/fanc_example_100kb.hic \
               architecture/domains/gm12878_tads.bed \
               architecture/aggregate/fanc_example_100kb.agg \
               -p architecture/aggregate/fanc_example_100kb_oe_large_res.png \
               -m architecture/aggregate/fanc_example_100kb_oe_large.agg.txt \
               -e -r 1.0 --rescale --vmax 0.045
../../_images/fanc_example_100kb_oe_large_res.png

Here, we are not log-transforming the data and we are setting the saturation of the pixel values at 0.045 using --vmax.

Aggregate TADs

For both the log2(O/E) and rescaled versions of the aggregate matrices, there are preset flags you can use called --tads and --tads-imakaev, respectively. The latter is named after the first author of the publication that first used rescaled aggregate matrices in this fashion (Flyamer, Gassler, and Imakaev et al., 2017). In the above example, you can simply run

fanc aggregate output/hic/binned/fanc_example_100kb.hic \
               architecture/domains/gm12878_tads.bed \
               architecture/aggregate/fanc_example_100kb.agg \
               -p architecture/aggregate/fanc_example_100kb_oe_large.agg.png \
               -m architecture/aggregate/fanc_example_100kb_oe_large.agg.txt \
               --tads

Fixed-width regions

Sometimes, you may want to use a fixed window surrounding a set of features in the aggregate analysis, such as TAD boundaries. fanc aggregate provides the -w option to plot the aggregate Hi-C matrix in a window os size w around the center of each region in the list provided.

fanc aggregate output/hic/binned/fanc_example_100kb.hic \
               architecture/domains/fanc_example_100kb.insulation_boundaries_score0.7_1mb.bed \
               architecture/aggregate/fanc_example_100kb_boundaries.agg \
               -w 5mb -p architecture/aggregate/fanc_example_100kb_boundaries.agg.png \
               -m architecture/aggregate/fanc_example_100kb_boundaries.agg.txt \
               --vmin 0 --vmax 0.03
../../_images/fanc_example_100kb_boundaries.agg.png

You can see the relatively faint “average boundary” in the centre of the plot. When using O/E and log2-transformed matrices, this becomes much more obvious:

fanc aggregate output/hic/binned/fanc_example_100kb.hic \
               architecture/domains/fanc_example_100kb.insulation_boundaries_score0.7_1mb.bed \
               architecture/aggregate/fanc_example_100kb_boundaries_oe.agg \
               -w 5mb -p architecture/aggregate/fanc_example_100kb_boundaries_oe.agg.png \
               -m architecture/aggregate/fanc_example_100kb_boundaries_oe.agg.txt \
               -e -l
../../_images/fanc_example_100kb_boundaries_oe.agg.png

You can change the viewpoint to other positions within a region, such as the 5’ end, using the -v option.

Loops and other pairwise genomic regions

When you have loop calls or other pairwise genomic regions in BEDPE format, you can use fanc aggregate to make aggregate loop plots. The preset for this is --loops.

fanc aggregate architecture/loops/rao2014.chr11_77400000_78600000.hic \
               architecture/loops/rao2014.chr11_77400000_78600000.loops_no_singlets.bedpe \
               architecture/loops/rao2014.chr11_77400000_78600000.loops_no_singlets.agg \
               -p architecture/aggregate/rao2014.chr11_77400000_78600000.loops_no_singlets.agg.png \
               --loops
../../_images/rao2014.chr11_77400000_78600000.loops_no_singlets.agg.png

Control the size of the plot using the --pixels argument.