Helper tools in fanc¶
FAN-C provides little helper tools that make working with Hi-C and associated data somewhat easier. These are not strictly necessary for matrix generation and analysis, but can often speed your analysis up or simply make it a little more convenient.
fanc from-txt: import Hic from text file¶
You can easily import Hi-C matrices from a compatible text file format, such as that from
HiC-Pro, with fanc to-txt
.
usage: fanc from-txt [-h] [-tmp] contacts regions output
Positional Arguments¶
- contacts
Contacts file in sparse matrix format, i.e. each row should contain <bin1><tab><bin2><tab><weight>.
- regions
Path to file with genomic regions, for example in BED format: <chromosome><tab><start><tab><end>. The BED can optionally contain the bin index, as corresponding to the index used in the contacts file.
- output
Output Hic file.
Named Arguments¶
- -tmp, --work-in-tmp
Work in temporary directory
The command requires two input files:
A sparse matrix with the tab-separated format
<bin1><tab><bin2><tab><weight>
:
1 1 40.385642 1 828 5.272852 1 1264 5.205258 ...
A regions file in BED format
<chromosome><tab><start><tab><end>[<tab><bin ID>]
:
chr1 0 1000000 1 chr1 1000000 2000000 2 chr1 2000000 3000000 3 chr1 3000000 4000000 4 ...The
<bin ID>
field is optional, but if provided it must correspond to the bins used in the matrix file. If not provided, bin indices will be 0-based!
The FAN-C example data contains some HiC-Pro example files that you can try this out on:
fanc from-txt hicpro/dixon_2M_1000000_iced.matrix hicpro/dixon_2M_1000000_abs.bed hicpro/dixon_2M_1000000_iced.hic
fanc dump: export Hic objects to text file¶
You can easily export FAN-C Hic objects to a txt file using fanc dump
.
usage: fanc dump [-h] [-s SUBSET] [-S] [--only-intra] [-e] [-l] [-u] [-tmp]
hic [matrix] [regions]
Positional Arguments¶
- hic
Hic file
- matrix
Output file for matrix entries. If not provided, will write to stdout.
- regions
Output file for Hic regions. If not provided, will write regions into matrix file.
Named Arguments¶
- -s, --subset
Only output this matrix subset. Format: <chr>[:<start>-<end>][–<chr>[:<start><end>]], e.g.: “chr1–chr1” to extract only the chromosome 1 submatrix; “chr2:3400000-4200000” to extract contacts of this region on chromosome 2 to all other regions in the genome;
- -S, --no-sparse
Store full, square matrix instead of sparse format.
- --only-intra
Only dump intra-chromosomal data. Dumps everything by default.
- -e, --observed-expected
O/E transform matrix values.
- -l, --log2
Log2-transform matrix values. Useful for O/E matrices (-e option)
- -u, --uncorrected
Output uncorrected (not normalised) matrix values).
- -tmp, --work-in-tmp
Work in temporary directory
If you only pass the Hic object the fanc dump
, it will write all Hi-C contacts to
the command line in a tab-delimited format with the columns: chromosome1, start1, end1,
chromosome2, start2, end2, weight (number of contacts). If you add a file path as
second argument, the data will be written to that file. If you instead pass the Hic file
and two output files, the first output file will have the matrix entries in sparse notation,
and the second file will have the Hic regions/bins. You can use -S
to export a full
matrix instead of a sparse one, but be warned that these can be extremely large.
If you are only interested in a specific sub-matrix, use the -s
or --subset
argument
of the for <chromosome>:[<start>-<end>] to export all contacts made by this particular
region across the whole genome. Use <chr>[:<start>-<end>]–<chr>[:<start><end>] to export
all contacts made between two regions. E.g. use chr1–chr1 to export the chromosome 1
sub-matrix.
fanc subset: create Hic objects by subsetting¶
It is sometimes useful to work with smaller Hi-C objects, for example for speed reasons
or to focus the analysis on a particular genomic region of interest. The fanc subset
command makes it possible to create a Hic object that only contains regions and contacts
between a user-specified genomic regions from an existing Hic object.
usage: fanc subset [-h] input output regions [regions ...]
Positional Arguments¶
- input
Input Hic file.
- output
Output Hic file.
- regions
List of regions that will be used in the output Hic object. All contacts between these regions will be in the output object. For example, “chr1 chr3” will result in a Hic object with all regions in chromosomes 1 and 3, plus all contacts within chromosome 1, all contacts within chromosome 3, and all contacts between chromosome 1 and 3. “chr1” will only contain regions and contactswithin chromosome 1.
fanc downsample: downsample Hic objects¶
Often Hi-C matrices have differing numbers of valid pairs, which can be a confounding factor
in many analyses. Differences can stem from varying sequencing depths, different library
qualities, or other experimental and computational factors. fanc downsample
is a utility
that downsamples Hic objects to a specific number of valid pairs.
usage: fanc downsample [-h] [-tmp] hic n output
Positional Arguments¶
- hic
Hic object to be downsampled.
- n
Sample size or reference Hi-C object. If sample size is < 1,will be interpreted as a fraction of valid pairs.
- output
Downsampled Hic output.
Named Arguments¶
- -tmp, --work-in-tmp
Work in temporary directory
By default, the sampling is done without replacement. This requires a fairly large amount
of system memory. If you are having trouble with memory usage, use sampling with
replacement (--with-replacement
).
Note
Sampling is done on uncorrected matrix values, so you may want to apply matrix
balancing using fanc hic -k
afterwards.
fanc fragments: in silico genome digestion¶
The fanc pairs
and fanc auto
commands accept FASTA files as --genome
argument,
and fanc
conveniently calculates the restriction fragments for you using the
restriction enzyme name specified with --restriction-enzyme
. However, the in silico
digestion can be time-consuming, and if you are processing multiple similar Hi-C libraries,
you can use the fanc fragments
utility to generate restriction fragments up front,
and use the resulting BED file as input for the --genome
argument.
If you supply an integer as the second positional argument instead of a restriction enzyme
name, fanc fragments
will perform binning rather than in silico digestion and return
a BED file with equally sized regions.
usage: fanc fragments [-h] [-c CHROMOSOMES] input re_or_bin_size output
Positional Arguments¶
- input
Path to genome file (FASTA, folder with FASTA, hdf5 file), which will be used in conjunction with the type of restriction enzyme to calculate fragments directly.
- re_or_bin_size
Restriction enzyme name or bin size to divide genome into fragments. Restriction names can be any supported by Biopython, which obtains data from REBASE (http://rebase.neb.com/rebase/rebase.html). Use commas to separate multiple restriction enzymes, e.g. ‘HindIII,MboI’
- output
Output file with restriction fragments in BED format.
Named Arguments¶
- -c, --chromosomes
Comma-separated list of chromosomes to include in fragments BED file. Other chromosomes will be excluded. The order of chromosomes will be as stated in the list.
fanc sort-sam: sort SAM files by name¶
The fanc pairs
command expects SAM/BAM files as input that have been sorted by name
(fanc auto
automatically sorts files). You can use samtools sort -n
to sort files,
but fanc sam-sort
will also do the sorting for you. it automatically chooses the fastest
sorting implementation available and also provides the option to work in a temporary folder,
which can speed the sorting up if you are working on a network volume.
usage: fanc sort-sam [-h] [-t THREADS] [-tmp] sam [output]
Positional Arguments¶
- sam
Input SAM/BAM
- output
Output SAM/BAM. If not provided, will replace input file with sorted version after sorting.
Named Arguments¶
- -t, --threads
Number of sorting threads (only when sambamba is available). Default: 1
- -tmp, --work-in-tmp
Work in temporary directory