.. _fanc-auto: ################################## Generating Hi-C matrices with fanc ################################## This part of the documentation will focus primarily on ``fanc auto`` - the most versatile command in the FAN-C toolkit. Its main goal is to convert any input to binned Hi-C matrices. The following schematic will give you an overview of what file types ``fanc auto`` can handle and how they are processed downstream. .. image:: images/fanc-auto-schematic.png ``fanc auto`` will map reads in FASTQ (or gzipped FASTQ) files to a reference genome, generating SAM/BAM files. SAM/BAM files with paired-end reads will be automatically sorted and mate pairs will be matched to generate Pairs files. Pairs files will be converted into fragment-level Hic objects. Multiple fragment-level Hic objects will be merged into a single Hi-C object. Finally, the fragment-level Hic object will be binned at various bin sizes. Internally, ``fanc auto`` constructs its Hi-C processing pipeline from more specialised ``fanc`` commands. When describing the different pipeline steps and how you can control them below, we will also reference the specialised command that is used to build each step of the pipeline. ********** Parameters ********** Each pipeline step in ``fanc auto`` is controlled by a specific set of parameters in ``fanc auto``. Some of these are mandatory for specific types of input, others are optional and affect, for example, the processing and filtering of objects in the pipeline. Here is the help output for ``fanc auto``: .. argparse:: :module: fanc.commands.auto :func: auto_parser :prog: fanc auto :nodescription: :nodefault: =================== Mandatory arguments =================== .. code:: python fanc auto < ... > ``fanc auto`` accepts any number of input files, which will be discussed below. The last positional argument (without '-') **must always be the output folder** for all intermediate and final FAN-C files. ``fanc auto`` will generate the following folder structure in the output folder: .. code:: bash output_folder ├── fastq ├── sam ├── pairs ├── hic │ └── binned └── plots └── stats You will find the processed files in these folders after completion of the pipeline - the folder names should be self-explanatory. ================= General arguments ================= ``fanc auto`` will name its output files using the ``-n`` parameter as prefix. If this is not provided, it will try to come up with a basename from the largest overlap of the input file names. As this can be surprising it is often best to specify the ``-n`` parameter directly. You should also use the ``-t`` parameter to set the maximum number of parallel threads used by ``fanc auto`` - by default, it uses only a single processor. It is highly recommended to set this to a much higher value to enable parallel processing of large datasets. If you are working on a network, or with multiple hard drives, you may want to use the ``-tmp`` option. It instructs ``fanc auto`` to perform as many calculations as possible in a temporary directory. *********** Input types *********** For this tutorial, we are going to use the example data provided on our `GitHub page `_ in the ``examples`` folder. It is a downsampled Hi-C library of a previously published human adrenal tissue dataset (`SRR4271982 of GSM2322539 `_) that only contains chromosomes 18 and 19. Now let us discuss the different input types ``fanc auto`` can handle. =========== FASTQ input =========== To process FASTQ files with ``fanc``, you must first have `Bowtie2 `_ or `BWA `_ installed on your system and available in your PATH. Additionally, you need the corresponding index for the reference genome of your choice. We currently recommend using BWA, as it supports chimeric reads, which are frequent in Hi-C libraries when ligation junctions are sequenced. Once you have these prerequisites, you can call ``fanc auto`` like this, assuming you are in the \ ``examples`` folder: .. code:: bash fanc auto SRR4271982_chr18_19_1.fastq.gzip SRR4271982_chr18_19_2.fastq.gzip \ ./example_output/ -i bwa-index/hg19_chr18_19.fa \ -g hg19_chr18_19_re_fragments.bed The first two arguments are the paired-end FASTQ files. ``fanc auto`` works with FASTQ and gzipped FASTQ files. In general, ``fanc auto`` assumes that two consecutive FASTQ file arguments are mate pairs (there is no pattern matching on _1 and _2 involved, so make sure you have the correct order of input files!). .. note:: When downloading FASTQ files from SRA using SRAtools, e.g. with `fastq-dump`, do not use the ``-I / --readids`` option, which appends ``.1`` or ``.2`` to the read name. This interferes with the sorting and read pairing step in FAN-C. **Read names of the two mates must be identical**. Following the FASTQ files as the last positional argument is the output folder (``example_output``). ``-i`` or ``--genome-index`` instructs ``fanc auto`` to use the specified index for mapping the FASTQ files to a reference genome. It will automatically determine whether a BWA mem or Bowtie2 index is provided and choose the mapping software accordingly. Other mappers are currently not supported (raise an `issue on GitHub `_ if you are interested in support for your favourite mapper). .. warning:: Always use the most comprehensive assembly of your genome of interest. Do not generate and index from a subset of chromosomes. If you want to limit the chromosomes in your Hi-C analysis, for example to canonical chromosomes, please do that using the ``-g`` argument! The last parameter (``-g``) is necessary for generating a fragment-level Hi-C map later in the pipeline. This will be explained in more detail in the next section. There are a few additional parameters that you can use to control the mapping process. Use ``--iterative`` to iterative mapping: Reads are initially trimmed to 25bp before mapping, and then iteratively expanded until a unique, high quality mapping location can be found. This can improve mapping efficiency by a few percent, as smaller reads have a lower likelihood of mismatches due to sequencing errors. ``-s`` or ``--step-size`` controls the size by which reads are extended at every iterative mapping step. As mentioned above, it is common to find reads in Hi-C libraries that contain a ligation junction sequence. FAN-C can automatically split these kinds of reads before mapping using the ``--split-ligation-junction`` option, which can improve mapping efficiency. ``fanc auto`` parallelises mapping by spawning multiple mapping processes internally. This can result in high disk I/O - if you have issues with poor performance, try using the ``--mapper-parallel`` option, which will instead use the multithreading of your chosen mapping software. If you are using Bowtie2, you can additionally use the ``--memory-map`` option, which will load the entire Bowtie2 index into memory to be shared across all Bowtie2 processes. Use this option if your system has a lot of memory available to speed up the mapping. Finally, if you are using the ``-tmp`` option, which causes ``fanc auto`` to perform most pipeline steps in a temporary directory, you may want to use the ``--split-fastq`` option to split the FASTQ files into smaller chunks before mapping, so you can save space on your ``tmp`` partition. The resulting BAM files are automatically handed to the next step in the pipeline, or you can provide SAM/BAM files to ``fanc auto`` directly. This is described in the following section. You can also perform the mapping separately with the ``fanc map`` command, which also gives you additional options for controlling the mapping process, and which is described in :ref:`fanc-map`. ============= SAM/BAM input ============= To process SAM/BAM files, no additional external software is required. However, we do recommend the installation of `Sambamba `_, which can greatly speed up the SAM sorting step required for merging mate pairs into the Pairs object. A minimal ``fanc auto`` command using SAM/BAM files could look like this: .. code:: bash fanc auto output/sam/SRR4271982_chr18_19_1.bam output/sam/SRR4271982_chr18_19_2.bam \ ./example_output/ -g hg19_chr18_19_re_fragments.bed Similarly to FASTQ input, ``fanc auto`` assumes that two consecutive SAM/BAM files represent mate pairs, and will match the read names in the pairing step. The ``-g`` or ``--genome`` parameter is mandatory for both FASTQ and SAM/BAM input, and is used to load (or construct) the restriction fragment regions necessary for building the fragment-level Hi-C object. You can either directly provide a region-based file with restriction fragments (most file formats are supported, including BED and GFF), or use a FASTA file with the genomic sequence in conjunction with the ``-r`` or ``--restriction-enzyme`` parameter. In the latter case, ``fanc auto`` will perform an in silico digestion of the genome and use the resulting restriction fragments from there. .. note:: Genome assembly FASTA files typically contain a large number of unassembled contigs or other sequences that are often irrelevant for downstream Hi-C analysis. As the number of chromosomes can negatively affect FAN-C performance, it is generally a good idea to limit the analysis to canonical chromosomes. A very easy way to do with with ``fanc`` is the ``fanc fragments`` command, which accepts a ``--chromosomes`` option to specify exactly which chromosomes you want in the final analysis. The output file can be directly used as input for the ``-g`` argument. SAM/BAM files are first sorted and then matched by ``qname``. Together with the restriction fragment list, mate pairs will be assigned to restriction fragments and stored in a "Pairs" object. By default, ``fanc auto`` excludes unmappable and multimapping reads, as these are unusable or misleading in interpreting Hi-C data. Additional filters for read pairs are described in the :ref:`fanc-auto-pairs` section. You can run the SAM/BAM to Pairs step of the ``fanc auto`` pipeline separately using ``fanc pairs``, which is described in mor detail in :ref:`fanc-pairs`. ======================= "Valid pairs" txt input ======================= Many tools for processing Hi-C data output "valid pairs" files, which are typically tab-delimited text files that contain read pair information. FAN-C supports valid pairs files from `HiC-Pro `_ and the `4D Nucleome project `_. With ``fanc auto`` you can load them like this .. code:: bash fanc auto test.validPairs ./example_output/ -g hg19_chr18_19_re_fragments.bed ``fanc auto`` will attempt to automatically determine if you supply a valid pairs file. .. _fanc-auto-pairs: =========== Pairs input =========== If you already have a FAN-C Pairs object, for example from a previous ``fanc auto`` run or from the ``fanc pairs`` command, you can feed them to ``fanc auto`` directly: .. code:: bash fanc auto output/pairs/fanc_example.pairs ./example_output/ The Pairs objects already contain restriction fragment information, hence the ``-g`` parameter is no longer necessary. Unless using the ``--no-filter-pairs`` option, ``fanc auto`` will first filter read pairs for self-ligated fragments, PCR duplicates, and restriction site distance (>10kb). You have the option to additionally filter out ligation error products using the ``--le-inward-cutoff`` and ``--le-outward-cutoff`` parameters. More details on the different filtering options are available in the description of the separate ``fanc pairs`` command: :ref:`fanc-pairs` After filtering, Pairs files are converted to fragment-level Hic objects. The parameters applying to their processing are described in the next section. ========= Hic input ========= If you already have a FAN-C Hic object, for example from a previous ``fanc auto`` run or from the ``fanc hic`` command, you can feed them to ``fanc auto`` directly: .. code:: bash fanc auto output/hic/fanc_example.hic ./example_output/ If you are running this command with multiple input files, these will be merged into a single fragment-level Hic object. This merged Hic object will then be binned at the resolutions specified with the ``-b`` parameter. By default, it will produce binned Hic files at 5mb, 2mb, 1mb, 500kb, 250kb, 100kb, 50kb, 25kb, 10kb, and 5kb resolution. Binned Hi-C files will be filtered for coverage (bins with less reads than 10% of the median bin coverage) and corrected using Knight-Ruiz matrix balancing. If you prefer ICE correction, use the ``--norm-method ice`` option, or ``--norm-method vc`` for vanilla coverage normalisation. Each chromosome in the matrix is corrected independently, and by default the corrected matrix entries correspond to contact probabilities. You can use the ``--restore-coverage`` option to force matrix entries in a chromosome to sum up to the total number of reads before correction. You can run the Hi-C processing step independently with the ``fanc hic`` command, as described in detail in :ref:`fanc-hic` =========== Mixed input =========== Now that we have covered all the different input options fort ``fanc auto``, it is worth stressing that you can combined different types of input in the same command. ``fanc auto`` will attempt to automatically determine the commands necessary for each input to run through the entire pipeline, and will merge inputs into a single fragment-level Hic object before binning. That means something like this is possible (although it does not make sense in this particular case): .. code:: bash fanc auto hic/test.hic pairs/test.pairs test.validPairs \ sam/SRR4271982_chr18_19_1.bam sam/SRR4271982_chr18_19_2.bam \ SRR4271982_chr18_19_1.fastq.gzip SRR4271982_chr18_19_2.fastq.gzip \ ./example_output/ -g hg19_chr18_19_re_fragments.bed -b 1mb, 50kb, 25kb \ -i bwa-index/hg19_chr18_19.fa -n test -s 20 -t 16 -q 3 ************************************************** Test runs and Sun/Oracle/Slurm Grid engine support ************************************************** By default, ``fanc auto`` runs tasks in parallel locally on the machine it was started on. If you want to perform a test run, without actually executing any commands, you can use the ``--run-with test`` option. This will not run any of the ``fanc`` pipeline steps, but will print each command it would run, including the dependencies between commands, to the command line. If you have access to a computational cluster running Sun/Oracle Grid Engine (SGE/OGE), you can instruct ``fanc auto`` to submit all commands to the cluster using ``--run-with sge``. Internally, this calls ``qsub`` on each command and uses the ``--hold_jid`` parameter to ensure each command waits for the output of its dependencies. You can configure the SGE setup using :ref:`fanc-config` There is also experimental support for `Slurm `_, which you can enable using ``--run-with slurm``. ********** Next steps ********** Once you have generated your binned, filtered, and corrected Hic objects with ``fanc auto``, you may want to explore the data in those matrices. FAN-C provides a number of commands for data analysis and exploration. Continue with :ref:`analyse_hic` for further details. If you want to explore individual matrix generation pipeline steps, continue to :ref:`fanc-modular`.