Individual pipeline steps

fanc map: Mapping FASTQ files

To (iteratively) map FASTQ files directly with FAN-C, use the fanc map command.

Here is a minimal example:

fanc map SRR4271982_chr18_19_1.fastq.gzip bwa-index/hg19_chr18_19.fa \
         SRR4271982_chr18_19_1.sam

This command will map the FASTQ file SRR4271982_chr18_19_1.fastq.gzip using the BWA index specified with -i bwa-index/hg19_chr18_19.fa and output the mapped reads to SRR4271982_chr18_19_1.sam.

Note

When downloading FASTQ files from SRA using SRAtools, e.g. with fastq-dump, do not use the -I / --readids option, which appends .1 or .2 to the read name. This interferes with the sorting and read pairing step in FAN-C. Read names of the two mates must be identical.

You can change the suffix of the output file to .bam and fanc map will automatically convert the mapping output to BAM format.

Importantly, fanc map will autodetect if you supply a BWA or Bowtie2 index, so the following command would use Bowtie2 as a mapper:

fanc map SRR4271982_chr18_19_1.fastq.gzip hg19_chr18_19/hg19_chr18_19 \
         SRR4271982_chr18_19_1.sam

You can use additional parameters to control the mapping process:

usage: fanc map [-h] [-m MIN_SIZE] [-s STEP_SIZE] [--trim-front] [-t THREADS]
                [-q QUALITY] [-r RESTRICTION_ENZYME] [-k MAX_ALIGNMENTS] [-a]
                [-b BATCH_SIZE] [--fanc-parallel] [--split-fastq]
                [--memory-map] [--no-iterative] [-tmp]
                input [input ...] index output

Positional Arguments

input

File name of the input FASTQ file (or gzipped FASTQ)

index

Bowtie 2 or BWA genome index base. Index type will be determined automatically.

output

Output file or folder. When providing multiple input files, this must be the path to an output folder.

Named Arguments

-m, --min-size

Minimum length of read before extension. Default 25.

-s, --step-size

Number of base pairs to extend at each round of mapping. Default is 10.

--trim-front

Trim reads from front instead of back.

-t, --threads

Number of threads used for mapping. Default: 1

-q, --quality

Mapping quality cutoff. Alignments with a quality score lower than this will be sent to another mapping iteration. Default: 3 (BWA), 30 (Bowtie2)

-r, --restriction-enzyme

Name (case sensitive) of restriction enzyme used in Hi-C experiment. Will be used to split reads by predicted ligation junction before mapping. You can omit this if you do not want to split your reads by ligation junction. Restriction names can be any supported by Biopython, which obtains data from REBASE (http://rebase.neb.com/rebase/rebase.html). For restriction enzyme cocktails, separate enzyme names with “,”

-k, --max-alignments

Maximum number of alignments per read to be reported.

-a, --all-alignments

Report all valid alignments of a read Warning: very slow!.

-b, --batch-size

Number of reads processed (mapped and merged) in one go per worker. The default 100000 works well for large indexes (e.g. human, mouse). Smaller indexes (e.g. yeast) will finish individual bowtie2 processes very quickly - set this number higher to spawn new processes less frequently.

--fanc-parallel

Use FAN-C parallelisation, which launches multiple mapper jobs. This may be faster in some cases than relying on the internal paralellisation of the mapper, but has potentially high disk I/O and memory usage.

--split-fastq

Split FASTQ file into 10M chunks before mapping. Easier on tmp partitions.

--memory-map

Map Bowtie2 index to memory. Enable if you you system has enough memory to hold the entire Bowtie2 index.

--no-iterative

Do not use iterative mapping strategy. (much faster, less sensitive).

-tmp, --work-in-tmp

Copy original file to temporary directory.Reduces network I/O.

Most importantly, assign more threads to the mapping process using the -t parameter. fanc auto parallelises mapping by spawning multiple mapping processes internally. This can result in high disk I/O - if you have issues with poor performance, try using the --mapper-parallel option, which will instead use the multithreading of your chosen mapping software.

fanc map SRR4271982_chr18_19_1.fastq.gzip bwa-index/hg19_chr18_19.fa \
         SRR4271982_chr18_19_1.sam -t 16

By default, fanc auto performs iterative mapping: Reads are initially trimmed to 25bp (change this with the -m option) before mapping, and then iteratively expanded by 10bp (change the step size with the -s option) until a unique, high quality mapping location can be found. The associated quality cutoff is 3 for BWA and 30 for Bowtie2, but can be changed with the -q parameter.

# expand by 5bp every iteration and be satisfied with lower quality
fanc map SRR4271982_chr18_19_1.fastq.gzip bwa-index/hg19_chr18_19.fa \
         SRR4271982_chr18_19_1.sam -t 16 -s 5 -q 10

The iterative mapping process is slower than simple mapping, but can typically improve mapping efficiency by a few percent, as smaller reads have a lower likelihood of mismatches due to sequencing errors. Disable iterative mapping with the --no-iterative parameter.

# do not perform iterative mapping
fanc map SRR4271982_chr18_19_1.fastq.gzip bwa-index/hg19_chr18_19.fa \
         SRR4271982_chr18_19_1.sam -t 16 --no-iterative

BWA will automatically split chimeric reads and return both mapping locations. This is especially useful for Hi-C data, as reads are often sequenced through a ligation junction, which BWA can often detect automatically. Nonetheless, mapping may be improved by splitting reads at predicted ligation junctions from the start. To enable this, use the -r parameter and supply the name of a restriction enzyme (e.g. HindIII or MboI). The name will be used to look up the enzyme’s restriction pattern, predict the sequence of a ligation junction, and split reads at the predicted junction before mapping starts. Reads split in this manner will have an additional attribute in the SAM/BAM file ZL:i:<n> where <n> is an integer denoting the part of the split read.

# Split reads at HindIII ligation junction before mapping
fanc map SRR4271982_chr18_19_1.fastq.gzip bwa-index/hg19_chr18_19.fa \
         SRR4271982_chr18_19_1.sam -t 16 -r HindIII

If you are using Bowtie2, you can additionally use the --memory-map option, which will load the entire Bowtie2 index into memory to be shared across all Bowtie2 processes. Use this option if your system has a lot of memory available to speed up the mapping. Finally, if you are using the -tmp option, which causes fanc auto to perform most pipeline steps in a temporary directory, you may want to use the --split-fastq option to split the FASTQ files into smaller chunks before mapping, so you can save space on your tmp partition.

fanc pairs: Generating and filtering read Pairs

The fanc pairs command handles the creation and modification of Pairs objects, which represent the mate pairs in a Hi-C library mapped to restriction fragments. Possible inputs are: two SAM/BAM files (paired-end reads, sorted by read name), a HiC-Pro valid pairs file, a 4D Nucleome pairs file, or an existing FAN-C Pairs object. For all but the latter, an output file is also required as last positional argument, for Pairs files the output file is optional. If omitted, filtering is performed in place.

Here are example commands for the different options.

SAM/BAM files:

fanc pairs output/sam/SRR4271982_chr18_19_1_sort.bam \
           output/sam/SRR4271982_chr18_19_2_sort.bam \
           output/pairs/SRR4271982_chr18_19.pairs \
           -g hg19_chr18_19_re_fragments.bed

4D Nucleome pairs file:

fanc pairs 4d_nucleome.pairs output/pairs/4d_nucleome.pairs \
           -g hg19_chr18_19_re_fragments.bed

HiC-Pro valid pairs file:

fanc pairs hic_pro.validPairs output/pairs/hic_pro.pairs \
           -g hg19_chr18_19_re_fragments.bed

Existing FAN-C Pairs object:

fanc pairs output/pairs/SRR4271982_chr18_19.pairs

As you can see, the -g parameter is not necessary when proving an existing Pairs object, as this already has all the fragment information stored in the object. Neither do we need an output file, as further operations will be performed in place. This primarily applies to the filtering of read pairs according to various criteria.

Note

Read names of the two mates must be identical. Some read names, especially those obtained from SRA using SRAtools with the -I / --readids option, have .1 or .2 appended to the read name. This interferes with the sorting and read pairing step in FAN-C, and you need to remove those endings before import.

Additional parameters primarily control the filtering of read pairs:

usage: fanc pairs [-h] [-g GENOME] [-r RESTRICTION_ENZYME] [-m] [-u] [-us]
                  [-q QUALITY] [-c CONTAMINANT] [-i INWARD] [-o OUTWARD]
                  [--filter-ligation-auto] [-d REDIST] [-l] [-p DUP_THRESH]
                  [-s STATS] [--reset-filters] [--statistics-plot STATS_PLOT]
                  [--re-dist-plot RE_DIST_PLOT]
                  [--ligation-error-plot LIGATION_ERROR_PLOT] [-t THREADS]
                  [-b BATCH_SIZE] [-S] [-f] [--bwa] [-tmp]
                  input [input ...]

Positional Arguments

input

IMPORTANT: The last positional argument will be the output file, unless only a single Pairs object is provided. In that case, filtering and correcting will be done in place. Possible inputs are: two SAM/BAM files (paired-end reads, sorted by read name) and an output file; a HiC-Pro pairs file (format: name<tab>chr1<tab>pos1<tab>strand1<tab>chr2<tab>pos2<tab>strand2) and an output file; a pairs file in 4D Nucleome format (https://github.com/4dn-dcic/pairix/blob/master/pairs_format_specification.md) and an output file, or an existing fanc Pairs object. In case of SAM/BAM, HiC-Pro, or 4D Nucleome you must also provide the –genome argument, and if –genome is not a file with restriction fragments (or Hi-C bins), you must also provide the –restriction-enzyme argument.

Named Arguments

-g, --genome

Path to region-based file (BED, GFF, …) containing the non-overlapping regions to be used for Hi-C binning. Typically restriction-enzyme fragments. Alternatively: Path to genome file (FASTA, folder with FASTA, HDF5 file), which will be used in conjunction with the type of restriction enzyme to calculate fragments directly.

-r, --restriction_enzyme

Name of the restriction enzyme used in the experiment, e.g. HindIII, or MboI. Case-sensitive, only necessary when –genome is provided as FASTA. Restriction names can be any supported by Biopython, which obtains data from REBASE (http://rebase.neb.com/rebase/rebase.html). Separate multiple restriction enzymes with “,”

-m, --filter-unmappable

Filter read pairs where one or both halves are unmappable. Only applies to SAM/BAM input!

-u, --filter-multimapping

Filter reads that map multiple times. If the other mapping locations have a lower score than the best one, the best read is kept. Only applies to SAM/BAM input!

-us, --filter-multimapping-strict

Strictly filter reads that map multiple times. Only applies to SAM/BAM input!

-q, --filter-quality

Cutoff for the minimum mapping quality of a read. For numbers larger than 1, will filter on MAPQ. If a number between 0 and 1 is provided, will filter on the AS tag instead of mapping quality (only BWA). The quality cutoff is then interpreted as the fraction of bases that have to be matched for any given read. Only applies to SAM/BAM input! Default: no mapping quality filter.

-c, --filter-contaminant

Filter contaminating reads from other organism. Path to mapped SAM/BAM file. Will filter out reads with the same name. Only applies to SAM/BAM input! Default: no contaminant filter

-i, --filter-inward

Minimum distance for inward-facing read pairs. Default: no inward ligation error filter

-o, --filter-outward

Minimum distance for outward-facing read pairs. Default: no outward ligation error filter

--filter-ligation-auto

Auto-guess settings for inward/outward read pair filters. Overrides –filter-outward and –filter-inward if set. This is highly experimental and known to overshoot in some cases. It is generally recommended to specify cutoffs manually.

-d, --filter-re-distance

Maximum distance for a read to the nearest restriction site. Default: no RE distance filter

-l, --filter-self-ligations

Remove read pairs representing self-ligated fragments.Default: no self-ligation filter.

-p, --filter-pcr-duplicates

If specified, filter read pairs for PCR duplicates. Parameter determines distance between alignment starts below which they are considered starting at same position. Sensible values are between 1 and 5. Default: no PCR duplicates filter

-s, --statistics

Path for saving filter statistics

--reset-filters

Remove all filters from the ReadPairs object.

--statistics-plot

Path for saving filter statistics plot (PDF)

--re-dist-plot

Plot the distribution of restriction site distances of all read pairs (sum left and right read).

--ligation-error-plot

Plot the relative orientation of read pairs mapped to the reference genome as a fraction of reads oriented in the same direction. Allows the identification of ligation errors as a function of genomic distance.

-t, --threads

Number of threads to use for extracting fragment information. Default: 1

-b, --batch-size

Batch size for read pairs to be submitted to individual processes. Default: 1000000

-S, --no-check-sorted

Assume SAM files are sorted and do not check if that is actually the case

-f, --force-overwrite

If the specified output file exists, it will be overwritten without warning.

--bwa

Use filters appropriate for BWA and not Bowtie2. This will typically be identified automatically from the SAM/BAM header. Set this flag if you are having problems during filtering (typically 0 reads pass the filtering threshold).

-tmp, --work-in-tmp

Work in temporary directory

Filtering

fanc pairs provides a lot of parameters for filtering read pairs according to different criteria. By default, if not specified otherwise, no filtering is performed on the read pairs (passthrough). Typically, however, you will at least want to filter out unmappable (-m) and multimapping reads (-u or -us). It is also a good idea to filter by alignment quality (-q <n>). Good cutoffs for Bowtie2 and BWA are 30 and 3, respectively. If you suspect your Hi-C library to be contaminated by DNA from a different organism, you can align your original reads to a different genome and pass the resulting SAM/BAM file to the -c parameter (ensure no unmappable reads are in the file!). This will filter out all reads that have a valid alignment in the putative contaminants genome (by qname). All of the above filters operate on single reads, but will filter out the pair if either of the reads is found to be invalid due to a filtering criterion.

Warning

The -u, -us, -q, and -c filter MUST be applied when loading read pairs from SAM/BAM file, and cannot be added later!

fanc pairs output/sam/SRR4271982_chr18_19_1_sort.bam \
           output/sam/SRR4271982_chr18_19_2_sort.bam \
           output/pairs/SRR4271982_chr18_19.pairs \
           -g hg19_chr18_19_re_fragments.bed \
           -us \
           -q 3

An additional set of filters operates on the properties of the read pair. You may want to filter out self-ligated fragments, which provide no spatial information with the -l parameter. As Hi-C experiments generally rely on PCR amplification, it is expected to find a lot of PCR duplicates in the library. You can filter those with the -p <n> parameter, where <n> denotes the distance between the start of two alignments that would still be considered a duplicate. Normally you would use 1 or 2, but you can use higher values to be more strict with filtering.

Example:

fanc pairs output/pairs/SRR4271982_chr18_19.pairs \
           -l  # filter self-ligated fragments \
           -p 2  # filter PCR duplicates mapping within 2bp

Depending on the experimental setup, it is sometimes expected to find valid Hi-C alignments near restriction sites. You can filter read pairs for their (cumulative) distance to the nearest restriction sites using the -d parameter. To determine that cutoff, or to detect any issues with the Hi-C library, you can first use the --re-dist-plot parameter. Note that this will only plot a sample of 10,000 read pairs for a quick assessment:

fanc pairs --re-dist-plot re-dist.png output/pairs/SRR4271982_chr18_19.pairs
../../_images/re-dist.png

Jin et al. (2013) have identified several errors that stem from incomplete digestion and which can be identified from different types of ligation products. You can filter these using the -i <n> and -o <n> parameters, for the inward and outward ligation errors, respectively. If you need help finding a good cutoff, you may use the --ligation-error-plot parameter.

fanc pairs --ligation-error-plot ligation-err.png output/pairs/SRR4271982_chr18_19.pairs
../../_images/ligation-err.png

Otherwise 1-10kb are often a reasonable cutoffs. You can also let FAN-C attempt to find suitable cutoffs based on the over-representation of certain ligation products using the --filter-ligation-auto parameter, but this is not always 100% reliable.

Finally, you can output the filtering statistics to a file or plot using the -s and --statistics-plot parameters, respectively.

fanc pairs output/pairs/SRR4271982_chr18_19.pairs \
           --statistics-plot pairs.stats.png
../../_images/pairs.stats.png

fanc hic: Generating, binning, and filtering Hic objects

The fanc hic command is used to generate fragment-level and binned Hi-C matrices.

You can use FAN-C Pairs files as input for fanc hic:

fanc hic output/pairs/SRR4271982_chr18_19.pairs output/hic/fragment_level.hic

Without additional parameters, this will generate a fragment-level Hic object and exit. Multiple Pairs files will be converted into fragment-level Hic objects which are then merged into a single object.

If you already have a fragment-level Hic file and you want to bin it, or perform filtering or matrix balancing, you can also use this as input:

fanc hic output/hic/fragment_level.hic output/hic/binned/example_1mb.hic -b 1mb

You have to explicitly provide the binning, filtering and correcting parameters, otherwise the command will exit after it has obtained a single fragment-level Hic object. Here is an overview of all parameters:

usage: fanc hic [-h] [-b BIN_SIZE] [-l FILTER_LOW_COVERAGE]
                [-r FILTER_LOW_COVERAGE_RELATIVE] [-a] [-d FILTER_DIAGONAL]
                [--marginals-plot MARGINALS_PLOT] [--reset-filters]
                [--downsample DOWNSAMPLE] [--subset SUBSET] [-i] [-k] [-n]
                [-m NORM_METHOD] [-w] [-c] [--only-inter] [-s STATS]
                [--statistics-plot STATS_PLOT]
                [--chromosomes CHROMOSOMES [CHROMOSOMES ...]] [-f]
                [-t THREADS] [--deepcopy] [-tmp]
                input [input ...]

Positional Arguments

input

IMPORTANT: The last positional argument will be the output file, unless only a single Hic object is provided. In that case, binning, filtering and correcting will be done in place. Input files. If these are FAN-C Pairs objects (see “fanc pairs”), they will be turned into Hic objects. Hic objects (also the ones converted from Pairs) will first be merged and the merged object will be binned, filtered and corrected as specified in the remaining parameters.

Named Arguments

-b, --bin-size

Bin size in base pairs. You can use human-readable formats,such as 10k, or 1mb. If omitted, the command will end after the merging step.

-l, --filter-low-coverage

Filter bins with low coverage (lower than specified absolute number of contacts)

-r, --filter-low-coverage-relative

Filter bins using a relative low coverage threshold (lower than the specified fraction of the median contact count)

-a, --low-coverage-auto

Filter bins with “low coverage” (under 10% of median coverage for all non-zero bins)

-d, --diagonal

Filter bins along the diagonal up to this specified distance. Use 0 for only filtering the diagonal.

--marginals-plot

Plot Hi-C marginals to determine low coverage thresholds.

--reset-filters

Remove all filters from the Hic object.

--downsample

Downsample a binned Hi-C object before filtering and correcting. Sample size or reference Hi-C object. If sample size is < 1,will be interpreted as a fraction of valid pairs.

--subset

Comma-separated list of regions that will be used in the output Hic object. All contacts between these regions will be in the output object. For example, “chr1,chr3” will result in a Hic object with all regions in chromosomes 1 and 3, plus all contacts within chromosome 1, all contacts within chromosome 3, and all contacts between chromosome 1 and 3. “chr1” will only contain regions and contactswithin chromosome 1.

-i, --ice-correct

DEPRECATED. Use ICE iterative correction on the binned Hic matrix

-k, --kr-correct

DEPRECATED. Use Knight-Ruiz matrix balancing to correct the binned Hic matrix

-n, --normalise

Normalise Hi-C matrix according to –norm-method

-m, --norm-method

Normalisation method used for -n. Options are: KR (default) = Knight-Ruiz matrix balancing (Fast, accurate, but memory-intensive normalisation); ICE = ICE matrix balancing (less accurate, but more memory-efficient); VC = vanilla coverage (a single round of ICE balancing);VC-SQRT = vanilla coverage square root (reduces overcorrection compared to VC)

-w, --whole-matrix

Correct the whole matrix at once, rather than individual chromosomes.

-c, --restore-coverage

Restore coverage to the original total number of reads. Otherwise matrix entries will be contact probabilities.

--only-inter

Calculate bias vector only on inter-chromosomal contacts. Ignores all intra-chromosomal contacts. Always uses whole-matrix balancing, i.e. implicitly sets -w

-s, --statistics

Path for saving filter statistics

--statistics-plot

Path for saving filter statistics plot (PDF)

--chromosomes

Limit output Hic object to these chromosomes. Only available in conjunction with “-b” option.

-f, --force-overwrite

If the specified output file exists, it will be overwritten without warning.

-t, --threads

Number of threads (currently used for binning only)

--deepcopy

Deep copy Hi-C file. Can be used

-tmp, --work-in-tmp

Work in temporary directory

Binning

You can use the -b parameter to bin the fragment-level Hi-C matrix. You can either use integers (1000000) or common abbreviations (1Mb). The filtering steps outlined below only apply to binned Hic matrices.

Filtering and balancing

fanc hic provides a few filtering options. Most likely you want to apply a coverage filter using -l to specify a coverage threshold in absolute number of pairs per bin, or -r to apply a coverage threshold based on a fraction of the median number of pairs pair bin. -a is simply a preset for -r 0.1.

For some applications it might be useful to remove the prominent Hi-C diagonal. You can use the -d <n> parameter to remove all pairs in pixels up to a distance of n from the diagonal.

You can balance your Hi-C matrices using the -n parameter. By default, this applies Knight-Ruiz (KR) balancing to each chromosome. You can opt for another normalisation method using --norm-method, for example --norm-method ice for ICE balancing, or --norm-method vc for vanilla coverage normalisation (which in this case is equivalent of a single ICE iteration). We typically recommend KR balancing for performance reasons. Each chromosome in the matrix is corrected independently, unless you specify the -w option.

Note

The default normalisation mode is per chromosome, and the FAN-C analysis functions assume this has been used. In the majority of cases, when you are working with intra-chromosomal data, this is what you want. Some commands have a --whole-genome option (or equivalent) to apply the analysis to the whole matrix instead of individual chromosomes. Only use this option if you have corrected the whole matrix at once (-w), and if you understand the effect of whole-matrix analysis on your results, compared to per-chromosome analysis!

When you are working with inter-chromosomal data, the chromosome’s bias vectors are re-used for normalisation. This is likely not the most suitable normalisation for your inter-chromosomal data! If you are interested in inter-chromosomal data specifically, you may want to use whole-genome correction, and possibly also exclude the intra-chromosomal contacts. You can do that using the --only-inter option.

By default the corrected matrix entries correspond to contact probabilities. You can use the --restore-coverage option to force matrix entries in a chromosome to sum up to the total number of reads before correction.

Finally, you can output the filtering statistics to a file or plot using the -s and --statistics-plot parameters, respectively.