Generating and filtering read pairs¶
Once we have mapped and sorted BAM files (see Mapping FASTQ files) we can load the paired reads
into a ReadPairs
object.
Obtaining fragments¶
ReadPairs
map reads to restriction fragments, so first of all we are
going to need a list of those. The fanc.regions.genome_regions()
function has been
made for that purpose. If you provide it with a FASTA file and a restriction enzyme name,
it will build the list of fragments for you:
from fanc.regions import genome_regions
genome_file = 'hg19_chr18_19.fa'
fragments = genome_regions(genome_file, restriction_enzyme='HindIII')
You can also give it a RegionBased
compatible file with fragments
directly (BED, GFF and similar work). Note that the fragments should cover the entire genome,
so don’t just use a list of restriction sites!
The returned object is of class RegionBased
. See
the main article
if you want to learn how to interact with objects of that class.
Setting up filters at the read level¶
Filters can be used to remove problematic reads, such as those with poor quality
(QualityFilter
) or with multiple mapping locations throughout
the genome (UniquenessFilter
). These filters (of class
ReadFilter
) have to be added already at the read pair import step,
as the necessary information is otherwise lost in later processing stages:
from fanc.pairs import QualityFilter, UniquenessFilter
quality_filter = QualityFilter(30, mask='MAPQ')
uniqueness_filter = UniquenessFilter(strict=True, mask='unique')
Warning
Both filters are specific to reads mapped with bowtie2
, and there are versions of
these filters specific for BWA-mapped reads (BwaMemQualityFilter
and BwaMemUniquenessFilter
). Make sure you use the appropriate
filter for your chosen mapper, as the quality and uniqueness definitions used by each
application differ substantially!
The QualityFilter
removes all read pairs where one or both reads have
a MAPQ value lower than 30
(in this case). The UniquenessFilter
has two settings. If strict=True
it will remove any read pair where either of the two
reads has the XS
tag, which indicates the presence of a secondary alignment. If
strict=False
, it further requires that the secondary alignment score is higher or equal
to the primary alignment score (AS
). The latter setting is therefore more permissive.
Now we have set up the necessary filters, we can start importing read pairs and match them to restriction fragments.
Importing and accessing read pairs¶
For importing read pairs from SAM files we can use generate_pairs_split()
.
It requires the two paired-end SAM (or BAM) files and fragment definitions. In addition, we
provide it with the read filters we created above, so that only reads meeting our
requirements are imported into the object. If we do not pass a file name with output_file
,
the resulting object is created in memory, which is useful for testing, but highly inadvisable
for actual Hi-C libraries. Here, we request four threads to be used for the loading in parallel:
from fanc.pairs import generate_pairs_split as generate_pairs
pairs_folder = mkdir(os.path.join(output_folder, 'pairs'))
pairs = generate_pairs(sam_1_file, sam_2_file, fragments,
read_filters=(quality_filter, uniqueness_filter),
output_file=os.path.join(pairs_folder, 'example.pairs'),
check_sorted=True, threads=4)
The resulting object is of the type ReadPairs
, which implements
RegionPairsContainer. It contains all valid read pairs matched to their respective
restriction fragments.
Let’s look at some basic information:
chromosomes = pairs.chromosomes()
print(chromosomes)
# ['chr18', 'chr19']
pair = pairs[0]
print(pair)
# chr18: 3187827-(3191583[1])-3192106 -- chr18: 3187827-(3192073[-1])-3192106
type(pair)
# fanc.pairs.FragmentReadPair
print(pair.left)
# chr18: 3187827-(3191583[1])-3192106
print(pair.right)
# chr18: 3187827-(3192073[-1])-3192106
type(pair.right)
# fanc.pairs.FragmentRead
print(pair.right.fragment)
# chr18:3187827-3192106
type(pair.right.fragment)
# genomic_regions.regions.GenomicRegion
print(pair.right.position)
# 3192073
print(pair.right.re_distance())
# 33
print(pair.right.strand)
# -1
As you can see, ReadPairs
is a container for
FragmentReadPair
objects. Each object contains information about its
two associated reads in the left
and
right
properties, respectively, which return a
FragmentRead
object.
FragmentRead
objects contain information about the read’s mapping
location (pair.right.position
and pair.right.strand
), and the fragment it maps to
(pair.right.fragment
). For convenience, you can also directly calculate the distance of
the read to the end of the fragment (which is the nearest restriction site) using
pair.right.re_distance()
. The pair.right.fragment
attribute is of type
GenomicRegion
, which has lots of useful properties which
you can find in the genomic_regions documentation.
Incidentally, the above pair is a self-ligated” fragment, which will be filtered out
in post-processing.
Typically you would not access each pair individually, but iterate over all pairs, or pair
subsets with the fanc.pairs.ReadPairs.pairs()
function:
# all pairs
for pair in pairs.pairs:
print(pair)
# pair subset (only pairs with both fragments on chromosome 18)
for pair in pairs.pairs(('chr18', 'chr18')):
print(pair)
The fanc.pairs.ReadPairs.pairs()
function supports lazy evaluation
with the lazy=True
argument, but make sure to read the caveats in the main article!
Read/fragment pair filters¶
In addition to the quality and uniqueness filters, which had to be used during read pair import,
FAN-C offers a number of FragmentReadPairFilter
to remove problematic pairs.
First of all, however, we’ll introduce one little function that you can use to reset all filters
(and restore the original pairs) in case you found your filtering t be too harsh or if you want
a fresh start, which is called reset_filters()
:
pairs.reset_filters()
We will illustrate how filters are used with the SelfLigationFilter
.
from fanc.pairs import SelfLigationFilter
sl_filter = SelfLigationFilter(mask='self-ligation')
# alternative:
# from fanc.general import Mask
# sl_filter = SelfLigationFilter(mask=Mask(name='self-ligation', description="Filter for self-ligated fragments")
pairs.filter(sl_filter)
In general, we first create a filter instance. The mask
parameter allows us to specify a
filter name. We can also pass it a more elaborate Mask
object with
properties for name and description (see commented code). Name and description of the mask will
be stored in the pairs
object upon filtering. The filter is run simply by calling
filter()
with the corresponding filter instance.
If you want to run multiple filters, it is more efficient to “queue” filters and then execute them all in one go:
from fanc.pairs import SelfLigationFilter, ReDistanceFilter
sl_filter = SelfLigationFilter(mask='self-ligation')
rd_filter = ReDistanceFilter(500, mask='re-site-distance')
pairs.filter(sl_filter, queue=True)
pairs.filter(rd_filter, queue=True)
pairs.run_queued_filters()
Once the filter()
command has completed, the filtered pairs will
appear to have been deleted from the object.
pair = pairs[0]
print(pair)
# chr18: 899140-(899308[-1])-899476 -- chr18: 1509911-(1510021[1])-1510076
In reality, filtering only masks (“hides”) edges, so we have the opportunity to reset filters or selectively disable them when iterating:
for pair in pairs.pairs(excluded_filters=['self-ligation']):
print(pair)
To obtain a dictionary with the filter statistics, use filter_statistics()
statistics = pairs.filter_statistics()
Filter types¶
Here are the filters available in FAN-C for read/fragment pairs:
SelfLigationFilter
removes all pairs where both reads map to the same fragmentReDistanceFilter
can filter for an expected DNA (not restriction) fragment size in Hi-C libraries that result from fragmentation (typically sonication). It sums up the distance of both reads to their nearest restriction site. If that sum exceeds the cutoff set at filter creation, it will be marked as invalidPCRDuplicateFilter
will find pairs mapping to the exact same genomic locations (both reads), and will only retain one copy of the exact duplicates.InwardPairsFilter
andOutwardPairsFilter
are removing pairs where both reads map within a specific distance on the same chromosome and are oriented towards or away from each other (in terms of strand), respectively. This is designed to remove ligation products that have likely arisen from uncut restriction sites or that are unligated.
Next, we will convert the filtered pairs to a Hic
object.