Map module¶
This module performs iterative mapping of reads in FASTQ files to a reference genome.
The main function is iterative_mapping()
, which requires an input FASTQ file,
an output SAM file, and a suitable mapper. Mapper options are Bowtie2Mapper
and BwaMapper
for the time being, but by subclassing Mapper
a user
can easily write their own mapper implementations that can fully leverage the iterative
mapping capabilities of FAN-C. Take a look at the code of Bowtie2Mapper
for
an example.
Example usage:
import fanc
mapper = fanc.BwaMapper("bwa-index/hg19_chr18_19.fa", min_quality=3)
fanc.iterative_mapping("SRR4271982_chr18_19_1.fastq.gzip", "SRR4271982_chr18_19_1.bam",
mapper, threads=4, restriction_enzyme="HindIII")
-
class
fanc.map.
Bowtie2Mapper
(bowtie2_index, min_quality=30, additional_arguments=(), threads=1, _bowtie2_path='bowtie2', **kwargs)¶ Bases:
fanc.map.Mapper
Bowtie2 Mapper for aligning reads against a reference genome.
Implements
Mapper
by calling the command line “bowtie2” program.-
bowtie2_index
¶ Path to the bowtie2 index of the reference genome of choice.
-
min_quality
¶ Minimum MAPQ of an alignment so that it won’t be resubmitted in iterative mapping.
-
additional_arguments
¶ Arguments passed to the “bowtie2” command in addition to -x, -U, –no-unal, –threads, and -S.
-
threads
¶ Number of threads for this mapping process.
-
close
()¶ Final operations after mapping completes.
-
map
(input_file, output_folder=None)¶ Map reads in the given FASTQ file using
_map()
implementation.Will internally map the FASTQ reads to a SAM file in a temporary folder (use output_folder to choose a specific folder), split the SAM output into (i) valid alignments according to
_resubmit()
and (ii) invalid alignments that get caught by the resubmission filter. A FASTQ file will be constructed from the invalid alignments, extending the reads by a given step size, which can then be used to iteratively repeat the mapping process until a valid alignment is found or the full length of the read has been restored.- Parameters
input_file – Path to FASTQ file
output_folder – (optional) path to temporary folder for SAM output
- Returns
tuple, path to valid SAM alignments, path to resubmission FASTQ
-
resubmit
(sam_fields)¶ Determine if an alignment should be resubmitted.
Filters unmappable reads by default. Additional criteria can be implemented using the
_resubmit()
method.- Parameters
sam_fields – The individual fields in a SAM line (split by tab)
- Returns
True if read should be extended and aligned to the reference again, False if the read passes the validity criteria.
-
-
class
fanc.map.
BwaMapper
(bwa_index, min_quality=0, additional_arguments=(), threads=1, algorithm='mem', memory_map=False, _bwa_path='bwa')¶ Bases:
fanc.map.Mapper
BWA Mapper for aligning reads against a reference genome.
Implements
Mapper
by calling the command line “bwa” program.-
bwa_index
¶ Path to the BWA index of the reference genome of choice.
-
min_quality
¶ Minimum MAPQ of an alignment so that it won’t be resubmitted in iterative mapping.
-
additional_arguments
¶ Arguments passed to the “bowtie2” command in addition to -t and -o.
-
threads
¶ Number of threads for this mapping process.
-
algorithm
¶ BWA algorithm to use for mapping. Uses “mem” by default. See http://bio-bwa.sourceforge.net/bwa.shtml for other options.
-
close
()¶ Final operations after mapping completes.
-
map
(input_file, output_folder=None)¶ Map reads in the given FASTQ file using
_map()
implementation.Will internally map the FASTQ reads to a SAM file in a temporary folder (use output_folder to choose a specific folder), split the SAM output into (i) valid alignments according to
_resubmit()
and (ii) invalid alignments that get caught by the resubmission filter. A FASTQ file will be constructed from the invalid alignments, extending the reads by a given step size, which can then be used to iteratively repeat the mapping process until a valid alignment is found or the full length of the read has been restored.- Parameters
input_file – Path to FASTQ file
output_folder – (optional) path to temporary folder for SAM output
- Returns
tuple, path to valid SAM alignments, path to resubmission FASTQ
-
resubmit
(sam_fields)¶ Determine if an alignment should be resubmitted.
Filters unmappable reads by default. Additional criteria can be implemented using the
_resubmit()
method.- Parameters
sam_fields – The individual fields in a SAM line (split by tab)
- Returns
True if read should be extended and aligned to the reference again, False if the read passes the validity criteria.
-
-
class
fanc.map.
SimpleBowtie2Mapper
(bowtie2_index, additional_arguments=(), threads=1, _bowtie2_path='bowtie2')¶ Bases:
fanc.map.Bowtie2Mapper
Bowtie2 Mapper for aligning reads against a reference genome without resubmission.
Implements
Mapper
by calling the command line “bowtie2” program. Does not resubmit reads under any circumstance.-
bowtie2_index
¶ Path to the bowtie2 index of the reference genome of choice.
-
additional_arguments
¶ Arguments passed to the “bowtie2” command in addition to -x, -U, –no-unal, –threads, and -S.
-
threads
¶ Number of threads for this mapping process.
-
close
()¶ Final operations after mapping completes.
-
map
(input_file, output_folder=None)¶ Map reads in the given FASTQ file using
_map()
implementation.Will internally map the FASTQ reads to a SAM file in a temporary folder (use output_folder to choose a specific folder), split the SAM output into (i) valid alignments according to
_resubmit()
and (ii) invalid alignments that get caught by the resubmission filter. A FASTQ file will be constructed from the invalid alignments, extending the reads by a given step size, which can then be used to iteratively repeat the mapping process until a valid alignment is found or the full length of the read has been restored.- Parameters
input_file – Path to FASTQ file
output_folder – (optional) path to temporary folder for SAM output
- Returns
tuple, path to valid SAM alignments, path to resubmission FASTQ
-
resubmit
(sam_fields)¶ Determine if an alignment should be resubmitted.
Filters unmappable reads by default. Additional criteria can be implemented using the
_resubmit()
method.- Parameters
sam_fields – The individual fields in a SAM line (split by tab)
- Returns
True if read should be extended and aligned to the reference again, False if the read passes the validity criteria.
-
-
class
fanc.map.
SimpleBwaMapper
(bwa_index, additional_arguments=(), threads=1, memory_map=False, _bwa_path='bwa')¶ Bases:
fanc.map.BwaMapper
BWA Mapper for aligning reads against a reference genome without resubmission.
Implements
Mapper
by calling the command line “bwa” program. Does not resubmit reads under any circumstance, i.e. does not perform iterative mapping.-
bwa_index
¶ Path to the BWA index of the reference genome of choice.
-
min_quality
¶ Minimum MAPQ of an alignment so that it won’t be resubmitted in iterative mapping.
-
additional_arguments
¶ Arguments passed to the “bowtie2” command in addition to -t and -o.
-
threads
¶ Number of threads for this mapping process.
-
algorithm
¶ BWA algorithm to use for mapping. Uses “mem” by default. See http://bio-bwa.sourceforge.net/bwa.shtml for other options.
-
close
()¶ Final operations after mapping completes.
-
map
(input_file, output_folder=None)¶ Map reads in the given FASTQ file using
_map()
implementation.Will internally map the FASTQ reads to a SAM file in a temporary folder (use output_folder to choose a specific folder), split the SAM output into (i) valid alignments according to
_resubmit()
and (ii) invalid alignments that get caught by the resubmission filter. A FASTQ file will be constructed from the invalid alignments, extending the reads by a given step size, which can then be used to iteratively repeat the mapping process until a valid alignment is found or the full length of the read has been restored.- Parameters
input_file – Path to FASTQ file
output_folder – (optional) path to temporary folder for SAM output
- Returns
tuple, path to valid SAM alignments, path to resubmission FASTQ
-
resubmit
(sam_fields)¶ Determine if an alignment should be resubmitted.
Filters unmappable reads by default. Additional criteria can be implemented using the
_resubmit()
method.- Parameters
sam_fields – The individual fields in a SAM line (split by tab)
- Returns
True if read should be extended and aligned to the reference again, False if the read passes the validity criteria.
-
-
fanc.map.
iterative_mapping
(fastq_file, sam_file, mapper, tmp_folder=None, threads=1, min_size=25, step_size=5, batch_size=200000, trim_front=False, restriction_enzyme=None)¶ Iteratively map sequencing reads using the provided mapper.
Will attempt to align a read using mapper. If unsuccessful, will truncate the read by step_size and attempt to align again. This is repeated until a successful alignment is found or the read gets truncated below min_size.
- Parameters
fastq_file – An input FASTQ file path with reds to align
sam_file – An output file path for sequencing results. If it ends with ‘.bam’ will compress output in bam format.
mapper – An instance of
Mapper
, e.g.Bowtie2Mapper
. OverrideMapper
for creating your own custom mappers.tmp_folder – A temporary folder for outputting subsets of FASTQ files
threads – Number of mapper threads to use in parallel.
min_size – Minimum length of read for which an alignment is attempted.
step_size – Number of base pairs by which to truncate read.
batch_size – Maximum number of reads processed in one batch
trim_front – Trim bases from front of read instead of back
restriction_enzyme – If provided, will calculate the expected ligation junction between reads and split reads accordingly. Both ends will be attempted to map. Can be the name of a restriction enzyme or a restriction pattern (e.g. A^AGCT_T)