Pairs module¶
-
class
fanc.pairs.
BwaMemQualityFilter
(cutoff=0.9, mask=None)¶ Bases:
fanc.pairs.ReadFilter
Filters
bwa mem
generated alignments based on the alignment score (normalized by the length of the alignment).-
valid_read
(read)¶ Check if a read has a high alignment score.
-
-
class
fanc.pairs.
BwaMemUniquenessFilter
(strict=False, mask=None)¶ Bases:
fanc.pairs.ReadFilter
Filters bwa mem generated alignments based on whether they are unique or not.
-
valid_read
(read)¶ Determine if a read is valid or should be filtered.
When implementing custom read filters this method must be overridden. It should return False for reads that are to be filtered and True otherwise.
Internally, the ReadPairs object will iterate over all Read instances to determine their validity on an individual basis.
- Parameters
read – A
Read
object- Returns
True if Read is valid, False otherwise
-
-
class
fanc.pairs.
ContaminantFilter
(contaminant_reads, mask=None)¶ Bases:
fanc.pairs.ReadFilter
Filter reads that also map to a contaminant genome.
-
valid_read
(read)¶ Check if a read also maps to a contaminant
-
-
class
fanc.pairs.
FourDNucleomePairGenerator
(pairs_file)¶ Bases:
fanc.pairs.TxtReadPairGenerator
Read pair generator that works on 4D Nucleome “.pairs” files.
For details on the 4D Nucleome pairs format see: https://github.com/4dn-dcic/pairix/blob/master/pairs_format_specification.md
-
add_filter
(read_filter)¶ Add a
ReadFilter
to this object.The filter will be applied during read pair generation.
- Parameters
read_filter –
ReadFilter
-
stats
()¶ Return filter statistics collected during read pair generation.
The
__iter__()
filters reads based on the filters added usingadd_filter()
. During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adict
of the form <filter_name>: <number of reads filtered out>.- Returns
dict
-
-
class
fanc.pairs.
FragmentRead
(fragment=None, position=None, strand=0)¶ Bases:
object
Class representing a fragment-mapped read.
-
fragment
¶ A
GenomicRegion
delineated by restriction sites.
-
position
¶ The position of this read in base-pairs (1-based) from the start of the chromosome it maps to.
-
strand
¶ The strand this read maps to (-1 or +1).
-
re_distance
()¶ Get the distance of the alignment to the nearest restriction site. :return: int
-
-
class
fanc.pairs.
FragmentReadPair
(left_read, right_read, ix=None)¶ Bases:
object
Container for two paired
FragmentRead
objects.-
get_gap_size
()¶ Get the gap size in base pairs between the fragments these reads map to.
- Returns
0 if reads map to the same fragment or neighboring fragments, the distance between fragments if they are on the same chromosome, None otherwise
-
is_inward_pair
()¶ Check if reads form an “inward-facing” pair.
A pair is considered inward-facing if the left read maps to the plus the right read to the minus strand and both reads map to the same chromosome.
- Returns
True if reads are inward-facing, False otherwise
-
is_outward_pair
()¶ Check if reads form an “outward-facing” pair.
A pair is considered outward-facing if the left read maps to the minus the right read to the plus strand and both reads map to the same chromosome.
- Returns
True if reads are outward-facing, False otherwise
-
is_same_chromosome
()¶ Check if both reads are mapped to the same chromosome.
- Returns
True is reads map to the same chromosome, False otherwise
-
is_same_fragment
()¶ Check if reads map to the same fragment.
- Returns
True if reads map to the same fragment, False otherwise.
-
is_same_pair
()¶ Check if reads face in the same direction.
- Returns
True if reads are facing in the same direction, False otherwise.
-
-
class
fanc.pairs.
FragmentReadPairFilter
(mask=None)¶ Bases:
fanc.general.MaskFilter
Abstract class that provides filtering functionality for the
FragmentReadPair
object.Extends MaskFilter and overrides valid(self, read) to make
FragmentReadPair
filtering more “natural”.To create custom filters for the
FragmentMappedReadPairs
object, extend this class and override thevalid_pair()
method. valid_pair should return False for a specificFragmentReadPair
object if the object is supposed to be filtered/masked and True otherwise. SeeInwardPairsFilter
for an example.Pass a custom filter to the filter method in
FragmentMappedReadPairs
to apply it.-
valid
(row)¶ Map validity check of rows to pairs.
-
-
class
fanc.pairs.
HicProPairGenerator
(file_name)¶ Bases:
fanc.pairs.TxtReadPairGenerator
Read pair generator for HiC-Pro “validPairs” files.
This generator is a subclass of
TxtReadPairGenerator
with presets for fields in HiC-Pro validPairs files.-
add_filter
(read_filter)¶ Add a
ReadFilter
to this object.The filter will be applied during read pair generation.
- Parameters
read_filter –
ReadFilter
-
stats
()¶ Return filter statistics collected during read pair generation.
The
__iter__()
filters reads based on the filters added usingadd_filter()
. During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adict
of the form <filter_name>: <number of reads filtered out>.- Returns
dict
-
-
class
fanc.pairs.
InwardPairsFilter
(minimum_distance=10000, mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilter
Filter inward-facing read pairs at a distance less than a specified cutoff.
-
valid
(row)¶ Map validity check of rows to pairs.
-
valid_pair
(pair)¶ Check if a pair is inward-facing and <minimum_distance> apart.
-
-
class
fanc.pairs.
LazyFragment
(row, pairs, ix=None, side='left')¶ Bases:
genomic_regions.regions.GenomicRegion
GenomicRegion
representing a fragment with lazy attribute loading.-
chromosome
¶ The reference sequence this region is located on
-
start
¶ start position of this region on the reference (1-based, inclusive)
-
end
¶ end position of this region on the reference (1-based, inclusive)
-
strand
¶ strand of the reference this region is located on (-1, 1, 0, or None)
-
ix
¶ Region index within the context of regions from the same object.
-
as_bed_line
(score_field='score', name_field='name')¶ Return a representation of this object as line in a BED file.
- Parameters
score_field – name of the attribute to be used in the ‘score’ field of the BED line
name_field – name of the attribute to be used in the ‘name’ field of the BED line
- Returns
str
-
as_gff_line
(source_field='source', feature_field='feature', score_field='score', frame_field='frame', float_format='.2e')¶ Return a representation of this object as line in a GFF file.
- Parameters
source_field – name of the attribute to be used in the ‘source’ field of the GFF line
feature_field – name of the attribute to be used in the ‘feature’ field of the GFF line
score_field – name of the attribute to be used in the ‘score’ field of the GFF line
frame_field – name of the attribute to be used in the ‘frame’ field of the GFF line
float_format – Formatting string for the float fields
- Returns
str
-
property
attributes
¶ Return all visible attributes of this
GenomicRegion
.Returns all attribute names that do not start with an underscore. :return: list of attribute names
-
property
center
¶ Return the center coordinate of the
GenomicRegion
.- Returns
float
-
contains
(region)¶ Check if the specified region is completely contained in this region.
- Parameters
region –
GenomicRegion
object or string
-
copy
()¶ Return a (shallow) copy of this
GenomicRegion
- Returns
GenomicRegion
-
expand
(absolute=None, relative=None, absolute_left=0, absolute_right=0, relative_left=0.0, relative_right=0.0, copy=True, from_center=False)¶ Expand this region by a relative or an absolute amount.
- Parameters
absolute – Absolute amount in base pairs by which to expand the region represented by this
GenomicRegion
object on both sides. New region start will be <old start - absolute>, new region end will be <old end + absolute>relative – Relative amount as fraction of region by which to expand the region represented by this
GenomicRegion
object on both sides. New region start will be <old start - relative*len(self)>, new region end will be <old end + relative*(len(self)>absolute_left – Absolute amount in base pairs by which to expand the region represented by this
GenomicRegion
object on the left sideabsolute_right – Absolute amount in base pairs by which to expand the region represented by this
GenomicRegion
object on the right siderelative_left – Relative amount in base pairs by which to expand the region represented by this
GenomicRegion
object on the left siderelative_right – Relative amount in base pairs by which to expand the region represented by this
GenomicRegion
object on the right sidecopy – If True, return a copy of the original region, if False will modify the existing region in place
from_center – If True measures distance from center rather than start and end of the old region
- Returns
GenomicRegion
-
property
five_prime
¶ Return the position of the 5’ end of this
GenomicRegion
on the reference.- Returns
int
-
fix_chromosome
(copy=False)¶ Change chromosome representation from chr<NN> to <NN> or vice versa.
- Parameters
copy – If True, make copy of region, otherwise will modify existing region in place.
- Returns
GenomicRegion
-
classmethod
from_string
(region_string)¶ Convert a string into a
GenomicRegion
.This is a very useful convenience function to quickly define a
GenomicRegion
object from a descriptor string. Numbers can be abbreviated as ‘12k’, ‘1.5M’, etc.- Parameters
region_string – A string of the form <chromosome>[:<start>-<end>[:<strand>]] (with square brackets indicating optional parts of the string). If any optional part of the string is omitted, intuitive defaults will be chosen.
- Returns
GenomicRegion
-
is_forward
()¶ Return True if this region is on the forward strand of the reference genome.
- Returns
True if on ‘+’ strand, False otherwise.
-
is_reverse
()¶ Return True if this region is on the reverse strand of the reference genome.
- Returns
True if on ‘-‘ strand, False otherwise.
-
overlap
(region)¶ Return the overlap in base pairs between this region and another region.
- Parameters
region –
GenomicRegion
to find overlap for- Returns
overlap as int in base pairs
-
overlaps
(region)¶ Check if this region overlaps with the specified region.
- Parameters
region –
GenomicRegion
object or string
-
set_attribute
(attribute, value)¶ Safely set an attribute on the
GenomicRegion
object.This automatically decodes bytes objects into UTF-8 strings. If you do not care about this, you can also use region.<attribute> = <value> directly.
- Parameters
attribute – Name of the attribute to be set
value – Value of the attribute to be set
-
property
strand_string
¶ Return the ‘strand’ attribute as string.
- Returns
strand as str (‘+’, ‘-‘, or ‘.’)
-
property
three_prime
¶ Return the position of the 3’ end of this
GenomicRegion
on the reference.- Returns
int
-
to_string
()¶ Convert this
GenomicRegion
to its string representation.- Returns
str
-
-
class
fanc.pairs.
LazyFragmentRead
(row, pairs, side='left')¶ Bases:
fanc.pairs.FragmentRead
FragmentRead
implementation with lazy attribute loading.-
fragment
¶ A
GenomicRegion
delineated by restriction sites.
-
position
¶ The position of this read in base-pairs (1-based) from the start of the chromosome it maps to.
-
strand
¶ The strand this read maps to (-1 or +1).
-
re_distance
()¶ Get the distance of the alignment to the nearest restriction site. :return: int
-
-
class
fanc.pairs.
MinimalRead
(chromosome, position, strand)¶ Bases:
object
Minimal class representing an aligned read.
-
chromosome
¶ Chromosome name
-
reference_name
¶ Identical to chromosome, exists for compatibility with pysam
-
position
¶ Position of aligned read on chromosome
-
pos
¶ Identical to position, exists for compatibility with Pysam
-
strand
¶ Strand of the alignment
-
flag
¶ Flag representing the strandedness of a read. 0 if on forward strand, else -1
-
-
class
fanc.pairs.
Monitor
(value=0, manager=None)¶ Bases:
fanc.tools.general.WorkerMonitor
Class to monitor fragment info worker threads.
-
is_generating_pairs
()¶ Check the pair generating status.
-
set_generating_pairs
(value)¶ Set the pair generating status.
-
-
class
fanc.pairs.
OutwardPairsFilter
(minimum_distance=10000, mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilter
Filter outward-facing read pairs at a distance less than a specified cutoff.
-
valid
(row)¶ Map validity check of rows to pairs.
-
-
class
fanc.pairs.
PCRDuplicateFilter
(pairs, threshold=2, mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilter
Masks alignments that are suspected to be PCR duplicates. In order to be considered duplicates, two pairs need to have identical start positions of their respective left alignments AND of their right alignments.
-
valid
(row)¶ Map validity check of rows to pairs.
-
valid_pair
(pair)¶ Check if a pair is duplicated.
-
-
class
fanc.pairs.
PairedSamBamReadPairGenerator
(sam_file)¶ Bases:
fanc.pairs.ReadPairGenerator
Generate read pairs from single SAM/BAM files from reads with identical qname.
Chimeric reads (mapping partially to multiple genomic locations), such as output by BWA, are handled as follows: If both mate pairs are chimeric, they are removed. If only one mate is chimeric, but it is split into more than 2 alignments, it is also removed. If one mate is chimeric and split into two alignments. If one part of the chimeric alignment maps within 100bp of the regular alignment, the read pair is kept and returned. In all other cases, the pair is removed.
-
add_filter
(read_filter)¶ Add a
ReadFilter
to this object.The filter will be applied during read pair generation.
- Parameters
read_filter –
ReadFilter
-
static
resolve_chimeric
(reads, max_dist_same_locus=500)¶ - Returns
read1, read2, is_chimeric
-
stats
()¶ Return filter statistics collected during read pair generation.
The
__iter__()
filters reads based on the filters added usingadd_filter()
. During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adict
of the form <filter_name>: <number of reads filtered out>.- Returns
dict
-
-
class
fanc.pairs.
QualityFilter
(cutoff=30, mask=None)¶ Bases:
fanc.pairs.ReadFilter
Filter mapped reads based on mapping quality.
-
valid_read
(read)¶ Check if a read has a mapq >= cutoff.
-
-
class
fanc.pairs.
ReDistanceFilter
(maximum_distance=10000, mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilter
Filters read pairs where the sum of distances of each read to its nearest restriction site is larger than a certain distance.
-
valid
(row)¶ Map validity check of rows to pairs.
-
valid_pair
(pair)¶ Check sum of restriction site distances < maximum_distance.
-
-
class
fanc.pairs.
ReadFilter
(mask=None)¶ Bases:
object
Abstract class that provides filtering functionality for
MinimalRead
,AlignedSegment
or compatible.To create a custom
ReadFilter
, extend this class and override the valid_read(self, read) method. valid_read should return False for a specific read object if the object is supposed to be filtered/masked and True otherwise. SeeQualityFilter
for an example.Pass a custom filter to the filter method in
ReadPairGenerator
to apply it.-
valid_read
(read)¶ Determine if a read is valid or should be filtered.
When implementing custom read filters this method must be overridden. It should return False for reads that are to be filtered and True otherwise.
Internally, the ReadPairs object will iterate over all Read instances to determine their validity on an individual basis.
- Parameters
read – A
Read
object- Returns
True if Read is valid, False otherwise
-
-
class
fanc.pairs.
ReadPairGenerator
¶ Bases:
object
Base class for generating and filtering read pairs.
This class primarily provides filtering capabilities for read pairs generated by subclasses of
ReadPairGenerator
. You can add aReadFilter
usingadd_filter()
, which will be used during read pair generation. Filtering statistics are collected during the run, and can be obtained viastats()
.These generators are primarily meant as input for
add_read_pairs()
, but can also be used independently.Subclasses of
ReadPairGenerator
must implement the_iter_read_pairs()
function.-
add_filter
(read_filter)¶ Add a
ReadFilter
to this object.The filter will be applied during read pair generation.
- Parameters
read_filter –
ReadFilter
-
stats
()¶ Return filter statistics collected during read pair generation.
The
__iter__()
filters reads based on the filters added usingadd_filter()
. During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adict
of the form <filter_name>: <number of reads filtered out>.- Returns
dict
-
-
class
fanc.pairs.
ReadPairs
(file_name=None, mode='a', _group_name='fragment_map', _table_name_fragments='fragments', _table_name_pairs='pairs', tmpdir=None)¶ Bases:
fanc.matrix.RegionPairsTable
Class representing a collection of read pairs mapped to restriction fragments.
This class is a
RegionBased
object, where eachGenomicRegion
represents a restriction fragment from a Hi-C experiment. A list of fragments can be obtained with thegenome_regions()
function, for example.To create a
ReadPairs
object, you first have to add the restriction fragments before adding read pairs:import fanc re_fragments = fanc.genome_regions("hg19_chr18_19.fa", "HindIII") rp = fanc.ReadPairs() rp.add_regions(re_fragments.regions)
Read pairs can easily be generate fro different types of input using
ReadPairGenerator
implementations, e.g.HicProReadPairGenerator
orSamBamReadPairGenerator
.rp_generator = fanc.SamBamReadPairGenerator("output/sam/SRR4271982_chr18_19_1_sort.bam", "output/sam/SRR4271982_chr18_19_2_sort.bam") rp.add_read_pairs(rp_generator, threads=4)
You can query regions using the
RegionBased
interface:chr1_fragments = rp.regions("chr1")
and you can iterate over read pairs using the
pairs()
:you can also use the :cls:Region
-
class
ChromosomeDescription
¶ Bases:
tables.description.IsDescription
Description of the chromosomes in this object.
-
class
MaskDescription
¶ Bases:
tables.description.IsDescription
-
class
RegionDescription
¶ Bases:
tables.description.IsDescription
Description of a genomic region for PyTables Table
-
add_contact
(contact, *args, **kwargs)¶ Alias for
add_edge()
- Parameters
contact –
Edge
args – Positional arguments passed to
_add_edge()
kwargs – Keyword arguments passed to
_add_edge()
-
add_contacts
(contacts, *args, **kwargs)¶ Alias for
add_edges()
-
add_edge
(edge, check_nodes_exist=True, *args, **kwargs)¶ Add an edge / contact between two regions to this object.
- Parameters
edge –
Edge
, dict with at least the attributes source and sink, optionally weight, or a list of length 2 (source, sink) or 3 (source, sink, weight).check_nodes_exist – Make sure that there are nodes that match source and sink indexes
args – Positional arguments passed to
_add_edge()
kwargs – Keyword arguments passed to
_add_edge()
-
add_edge_from_dict
(edge, *args, **kwargs)¶ Direct method to add an edge from dict input.
- Parameters
edge – dict with at least the keys “source” and “sink”. Additional keys will be loaded as edge attributes
-
add_edge_from_edge
(edge, *args, **kwargs)¶ Direct method to add an edge from
Edge
input.- Parameters
edge –
Edge
-
add_edge_from_list
(edge, *args, **kwargs)¶ Direct method to add an edge from list or tuple input.
- Parameters
edge – List or tuple. Should be of length 2 (source, sink) or 3 (source, sink, weight)
-
add_edge_simple
(source, sink, weight=None, *args, **kwargs)¶ Direct method to add an edge from
Edge
input.- Parameters
source – Source region index
sink – Sink region index
weight – Weight of the edge
-
add_edges
(edges, flush=True, *args, **kwargs)¶ Bulk-add edges from a list.
List items can be any of the supported edge types, list, tuple, dict, or
Edge
. Repeatedly callsadd_edge()
, so may be inefficient for large amounts of data.- Parameters
edges – List (or iterator) of edges. See
add_edge()
for details
-
add_mask_description
(name, description)¶ Add a mask description to the _mask table and return its ID.
- Parameters
name (str) – name of the mask
description (str) – description of the mask
- Returns
id of the mask
- Return type
int
-
add_read_pairs
(read_pairs, batch_size=1000000, threads=1)¶ Add read pairs to this object.
This function requires tuples of read pairs as input, for example from
MinimalRead
orAlignedSegment
. Typically, you won’t have to construct these from scratch, but can use one of theReadPairGenerator
classes to generate read pairs from input files.import fanc re_fragments = fanc.genome_regions("hg19_chr18_19.fa", "HindIII") rp = fanc.ReadPairs() rp.add_regions(re_fragments.regions) # read pairs are added here from BAM files rp_generator = fanc.SamBamReadPairGenerator("output/sam/SRR4271982_chr18_19_1_sort.bam", "output/sam/SRR4271982_chr18_19_2_sort.bam") rp.add_read_pairs(rp_generator, threads=4)
- Parameters
read_pairs – iterator over tuples of read pairs. Typically instances of
ReadPairGenerator
batch_size – Batch size of read pairs sent to fragment info workers
threads – Number of threads for simultaneous fragment info finding
-
add_region
(region, *args, **kwargs)¶ Add a genomic region to this object.
This method offers some flexibility in the types of objects that can be loaded. See parameters for details.
- Parameters
region – Can be a
GenomicRegion
, a str in the form ‘<chromosome>:<start>-<end>[:<strand>], a dict with at least the fields ‘chromosome’, ‘start’, and ‘end’, optionally ‘ix’, or a list of length 3 (chromosome, start, end) or 4 (ix, chromosome, start, end).
-
add_regions
(regions, *args, **kwargs)¶ Bulk insert multiple genomic regions.
- Parameters
regions – List (or any iterator) with objects that describe a genomic region. See
add_region
for options.
-
static
bin_intervals
(intervals, bins, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)¶ Bin a given set of intervals into a fixed number of bins.
- Parameters
intervals – iterator of tuples (start, end, score)
bins – Number of bins to divide the region into
interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
- Returns
iterator of tuples: (start, end, score)
-
static
bin_intervals_equidistant
(intervals, bin_size, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)¶ Bin a given set of intervals into bins with a fixed size.
- Parameters
intervals – iterator of tuples (start, end, score)
bin_size – Size of each bin in base pairs
interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
- Returns
iterator of tuples: (start, end, score)
-
property
bin_size
¶ Return the length of the first region in the dataset.
Assumes all bins have equal size.
- Returns
int
-
binned_regions
(region=None, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, *args, **kwargs)¶ Same as region_intervals, but returns
GenomicRegion
objects instead of tuples.- Parameters
region – String or class:~GenomicRegion object denoting the region to be binned
bins – Number of bins to divide the region into
bin_size – Size of each bin (alternative to bins argument)
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
args – Arguments passed to _region_intervals
kwargs – Keyword arguments passed to _region_intervals
- Returns
iterator of
GenomicRegion
objects
-
bins_to_distance
(bins)¶ Convert fraction of bins to base pairs
- Parameters
bins – float, fraction of bins
- Returns
int, base pairs
-
property
chromosome_bins
¶ Returns a dictionary of chromosomes and the start and end index of the bins they cover.
Returned list is range-compatible, i.e. chromosome bins [0,5] cover chromosomes 1, 2, 3, and 4, not 5.
-
property
chromosome_lengths
¶ Returns a dictionary of chromosomes and their length in bp.
-
chromosomes
()¶ List all chromosomes in this regions table. :return: list of chromosome names.
-
close
(copy_tmp=True, remove_tmp=True)¶ Close this HDF5 file and run exit operations.
If file was opened with tmpdir in read-only mode: close file and delete temporary copy.
If file was opened with tmpdir in write or append mode: Replace original file with copy and delete copy.
- Parameters
copy_tmp – If False, does not overwrite original with modified file.
remove_tmp – If False, does not delete temporary copy of file.
-
distance_to_bins
(distance)¶ Convert base pairs to fraction of bins.
- Parameters
distance – distance in base pairs
- Returns
float, distance as fraction of bin size
-
downsample
(n, file_name=None)¶ Sample edges from this object.
Sampling is always done on uncorrected Hi-C matrices.
- Parameters
n – Sample size or reference object. If n < 1 will be interpreted as a fraction of total reads in this object.
file_name – Output file name for down-sampled object.
- Returns
RegionPairsTable
-
edge_data
(attribute, *args, **kwargs)¶ Iterate over specific edge attribute.
- Parameters
attribute – Name of the attribute, e.g. “weight”
args – Positional arguments passed to
edges()
kwargs – Keyword arguments passed to
edges()
- Returns
iterator over edge attribute
-
edge_subset
(key=None, *args, **kwargs)¶ Get a subset of edges.
This is an alias for
edges()
.- Returns
generator (
Edge
)
-
property
edges
¶ Iterate over contacts / edges.
edges()
is the central function ofRegionPairsContainer
. Here, we will use theHic
implementation for demonstration purposes, but the usage is exactly the same for all compatible objects implementingRegionPairsContainer
, includingJuicerHic
andCoolerHic
.import fanc # file from FAN-C examples hic = fanc.load("output/hic/binned/fanc_example_1mb.hic")
We can easily find the number of edges in the sample
Hic
object:len(hic.edges) # 8695
When used in an iterator context,
edges()
iterates over all edges in theRegionPairsContainer
:for edge in hic.edges: # do something with edge print(edge) # 42--42; bias: 5.797788472650082e-05; sink_node: chr18:42000001-43000000; source_node: chr18:42000001-43000000; weight: 0.12291311562018173 # 24--28; bias: 6.496381719803623e-05; sink_node: chr18:28000001-29000000; source_node: chr18:24000001-25000000; weight: 0.025205961072838057 # 5--76; bias: 0.00010230955745211447; sink_node: chr18:76000001-77000000; source_node: chr18:5000001-6000000; weight: 0.00961709840049876 # 66--68; bias: 8.248432587969082e-05; sink_node: chr18:68000001-69000000; source_node: chr18:66000001-67000000; weight: 0.03876763316345468 # ...
Calling
edges()
as a method has the same effect:# note the '()' for edge in hic.edges(): # do something with edge print(edge) # 42--42; bias: 5.797788472650082e-05; sink_node: chr18:42000001-43000000; source_node: chr18:42000001-43000000; weight: 0.12291311562018173 # 24--28; bias: 6.496381719803623e-05; sink_node: chr18:28000001-29000000; source_node: chr18:24000001-25000000; weight: 0.025205961072838057 # 5--76; bias: 0.00010230955745211447; sink_node: chr18:76000001-77000000; source_node: chr18:5000001-6000000; weight: 0.00961709840049876 # 66--68; bias: 8.248432587969082e-05; sink_node: chr18:68000001-69000000; source_node: chr18:66000001-67000000; weight: 0.03876763316345468 # ...
Rather than iterate over all edges in the object, we can select only a subset. If the key is a string or a
GenomicRegion
, all non-zero edges connecting the region described by the key to any other region are returned. If the key is a tuple of strings orGenomicRegion
, only edges between the two regions are returned.# select all edges between chromosome 19 # and any other region: for edge in hic.edges("chr19"): print(edge) # 49--106; bias: 0.00026372303696871666; sink_node: chr19:27000001-28000000; source_node: chr18:49000001-50000000; weight: 0.003692122517562033 # 6--82; bias: 0.00021923129703834945; sink_node: chr19:3000001-4000000; source_node: chr18:6000001-7000000; weight: 0.0008769251881533978 # 47--107; bias: 0.00012820949175399097; sink_node: chr19:28000001-29000000; source_node: chr18:47000001-48000000; weight: 0.0015385139010478917 # 38--112; bias: 0.0001493344481069762; sink_node: chr19:33000001-34000000; source_node: chr18:38000001-39000000; weight: 0.0005973377924279048 # ... # select all edges that are only on # chromosome 19 for edge in hic.edges(('chr19', 'chr19')): print(edge) # 90--116; bias: 0.00021173151730025176; sink_node: chr19:37000001-38000000; source_node: chr19:11000001-12000000; weight: 0.009104455243910825 # 135--135; bias: 0.00018003890596887822; sink_node: chr19:56000001-57000000; source_node: chr19:56000001-57000000; weight: 0.10028167062466517 # 123--123; bias: 0.00011063368998965993; sink_node: chr19:44000001-45000000; source_node: chr19:44000001-45000000; weight: 0.1386240135570439 # 92--93; bias: 0.00040851066434864896; sink_node: chr19:14000001-15000000; source_node: chr19:13000001-14000000; weight: 0.10090213409411629 # ... # select inter-chromosomal edges # between chromosomes 18 and 19 for edge in hic.edges(('chr18', 'chr19')): print(edge) # 49--106; bias: 0.00026372303696871666; sink_node: chr19:27000001-28000000; source_node: chr18:49000001-50000000; weight: 0.003692122517562033 # 6--82; bias: 0.00021923129703834945; sink_node: chr19:3000001-4000000; source_node: chr18:6000001-7000000; weight: 0.0008769251881533978 # 47--107; bias: 0.00012820949175399097; sink_node: chr19:28000001-29000000; source_node: chr18:47000001-48000000; weight: 0.0015385139010478917 # 38--112; bias: 0.0001493344481069762; sink_node: chr19:33000001-34000000; source_node: chr18:38000001-39000000; weight: 0.0005973377924279048 # ...
By default,
edges()
will retrieve all edge attributes, which can be slow when iterating over a lot of edges. This is why all file-based FAN-CRegionPairsContainer
objects support lazy loading, where attributes are only read on demand.for edge in hic.edges('chr18', lazy=True): print(edge.source, edge.sink, edge.weight, edge) # 42 42 0.12291311562018173 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #0> # 24 28 0.025205961072838057 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #1> # 5 76 0.00961709840049876 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #2> # 66 68 0.03876763316345468 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #3> # ...
Warning
The lazy iterator reuses the
LazyEdge
object in every iteration, and overwrites theLazyEdge
attributes. Therefore do not use lazy iterators if you need to store edge objects for later access. For example, the following code works as expectedlist(hic.edges())
, with allEdge
objects stored in the list, while this codelist(hic.edges(lazy=True))
will result in a list of identicalLazyEdge
objects. Always ensure you do all edge processing in the loop when working with lazy iterators!When working with normalised contact frequencies, such as obtained through matrix balancing in the example above,
edges()
automatically returns normalised edge weights. In addition, thebias
attribute will (typically) have a value different from 1.When you are interested in the raw contact frequency, use the
norm=False
parameter:for edge in hic.edges('chr18', lazy=True, norm=False): print(edge.source, edge.sink, edge.weight) # 42 42 2120.0 # 24 28 388.0 # 5 76 94.0 # 66 68 470.0 # ...
You can also choose to omit all intra- or inter-chromosomal edges using
intra_chromosomal=False
orinter_chromosomal=False
, respectively.- Returns
Iterator over
Edge
or equivalent.
-
edges_dict
(*args, **kwargs)¶ Edges iterator with access by bracket notation.
This iterator always returns unnormalised edges.
- Returns
dict or dict-like iterator
-
filter
(pair_filter, queue=False, log_progress=True)¶ Apply a
FragmentReadPairFilter
to the read pairs in this object.- Parameters
pair_filter –
FragmentReadPairFilter
queue – If True, does not do the filtering immediately, but queues this filter. All queued filters can then be run at the same time using
run_queued_filters()
log_progress –
- Returns
-
filter_inward
(minimum_distance=None, queue=False, **kwargs)¶ Convenience function that applies an
InwardPairsFilter
.- Parameters
minimum_distance – Minimum distance inward-facing read pairs must have to pass this filter
queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
kwargs – Additional arguments to pass to
get_ligation_structure_biases()
-
filter_ligation_products
(inward_threshold=None, outward_threshold=None, queue=False, **kwargs)¶ Convenience function that applies an
OutwardPairsFilter
and anInwardPairsFilter
.- Parameters
inward_threshold – Minimum distance inward-facing read pairs must have to pass this filter. If None, will be inferred from the data
outward_threshold – Minimum distance outward-facing read pairs must have to pass this filter. If None, will be inferred from the data
queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
kwargs – Additional arguments to pass to
get_ligation_structure_biases()
-
filter_outward
(minimum_distance=None, queue=False, **kwargs)¶ Convenience function that applies an
OutwardPairsFilter
.- Parameters
minimum_distance – Minimum distance outward-facing read pairs must have to pass this filter
queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
kwargs – Additional arguments to pass to
get_ligation_structure_biases()
-
filter_pcr_duplicates
(threshold=3, queue=False)¶ Convenience function that applies an
PCRDuplicateFilter
.- Parameters
threshold – If distance between two alignments is smaller or equal the threshold, the alignments are considered to be starting at the same position
queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
-
filter_re_dist
(maximum_distance, queue=False)¶ Convenience function that applies an
ReDistanceFilter
.- Parameters
maximum_distance – Maximum distance a read can have to the nearest region border (=restriction site)
queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
-
filter_self_ligated
(queue=False)¶ Convenience function that applies an
SelfLigationFilter
.- Parameters
queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
-
filter_statistics
()¶ Get filtering statistics for this object. :return: dict with {filter_type: count, …}
-
find_region
(query_regions, _regions_dict=None, _region_ends=None, _chromosomes=None)¶ Find the region that is at the center of a region.
- Parameters
query_regions – Region selector string, :class:~GenomicRegion, or list of the former
- Returns
index (or list of indexes) of the region at the center of the query region
-
flush
(silent=False)¶ Write buffered data to file and update indexes,
- Parameters
silent – If True, does not use progressbars.
-
get_edge
(item, *row_conversion_args, **row_conversion_kwargs)¶ Get an edge by index.
- Parameters
row_conversion_args – Arguments passed to
RegionPairs._row_to_edge()
row_conversion_kwargs – Keyword arguments passed to
RegionPairs._row_to_edge()
- Returns
Edge
-
get_ligation_structure_biases
(sampling=None, skip_self_ligations=True, **kwargs)¶ Compute the ligation biases (inward and outward to same-strand) of this data set.
- Parameters
sampling – Approximate number of data points to average per point in the plot. If None (default), this will be determined on a best-guess basis.
skip_self_ligations – If True (default), will not consider self-ligated fragments for assessing the error rates.
unfiltered – If True, uses all read pairs, even those that do not pass filters, for the ligation error calculation
- Returns
tuple with (list of gap sizes between reads, list of matching le type ratios)
-
get_mask
(key)¶ Search _mask table for key and return Mask.
- Parameters
key (int) – search by mask name
key – search by mask ID
- Returns
Mask
-
get_masks
(ix)¶ Extract mask IDs encoded in parameter and return masks.
IDs are powers of 2, so a single int field in the table can hold multiple masks by simply adding up the IDs. Similar principle to UNIX chmod (although that uses base 8)
- Parameters
ix (int) – integer that is the sum of powers of 2. Note that this value is not necessarily itself a power of 2.
- Returns
list of Masks extracted from ix
- Return type
list (Mask)
-
intervals
(*args, **kwargs)¶ Alias for region_intervals.
-
mappable
(region=None)¶ Get the mappability of regions in this object.
A “mappable” region has at least one contact to another region in the genome.
- Returns
array
where True means mappable and False unmappable
-
classmethod
merge
(pairs, *args, **kwargs)¶ Merge two or more
RegionPairsTable
objects.- Parameters
pairs – list of
RegionPairsTable
- Returns
merged
RegionPairsTable
-
pairs
(key=None, lazy=False, *args, **kwargs)¶ Iterate over the
FragmentReadPair
objects.- Parameters
key – Region string of the form <chromosome>[:<start>-<end>],
GenomicRegion
or tuples thereoflazy – If True, use lazy loading of objects and their attributes. Much faster, but can lead to unexpected results if one is not careful. For example, this:
list(object.pairs())
is not the same aslist(object.pairs(lazy=True))
! In the latter case, all objects in the list will be identical due to the lazy loading process. Only use lazy loading to access attributes in an iterator!args – Positional arguments passed to
edges_dict()
kwargs – Keyword arguments passed to
edges_dict()
- Returns
-
pairs_by_chromosomes
(chromosome1, chromosome2, **kwargs)¶ Only iterate over read pairs in this combination of chromosomes. :param chromosome1: Name of first chromosome :param chromosome2: Name of second chromosome :param kwargs: Keyword arguments passed to
pairs()
:return:
-
region_bins
(*args, **kwargs)¶ Return slice of start and end indices spanned by a region.
- Parameters
args – provide a
GenomicRegion
here to get the slice of start and end bins of onlythis region. To get the slice over all regions leave this blank.- Returns
-
region_data
(key, value=None)¶ Retrieve or add vector-data to this object. If there is existing data in this object with the same name, it will be replaced
- Parameters
key – Name of the data column
value – vector with region-based data (one entry per region)
-
region_intervals
(region, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, score_field='score', *args, **kwargs)¶ Return equally-sized genomic intervals and associated scores.
Use either bins or bin_size argument to control binning.
- Parameters
region – String or class:~GenomicRegion object denoting the region to be binned
bins – Number of bins to divide the region into
bin_size – Size of each bin (alternative to bins argument)
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
args – Arguments passed to _region_intervals
kwargs – Keyword arguments passed to _region_intervals
- Returns
iterator of tuples: (start, end, score)
-
region_subset
(region, *args, **kwargs)¶ Takes a class:~GenomicRegion and returns all regions that overlap with the supplied region.
- Parameters
region – String or class:~GenomicRegion object for which covered bins will be returned.
-
property
regions
¶ Iterate over genomic regions in this object.
Will return a
GenomicRegion
object in every iteration. Can also be used to get the number of regions by calling len() on the object returned by this method.- Returns
RegionIter
-
regions_and_edges
(key, *args, **kwargs)¶ Convenient access to regions and edges selected by key.
- Parameters
key – Edge selector, see
edges()
args – Positional arguments passed to
edges()
kwargs – Keyword arguments passed to
edges()
- Returns
list of row regions, list of col regions, iterator over edges
-
property
regions_dict
¶ Return a dictionary with region index as keys and regions as values.
- Returns
dict {region.ix: region, …}
-
static
regions_identical
(pairs)¶ Check if the regions in all objects in the list are identical.
- Parameters
pairs –
list
ofRegionBased
objects- Returns
True if chromosome, start, and end are identical between all regions in the same list positions.
-
run_queued_filters
(log_progress=True)¶ Run queued filters. See
filter()
- Parameters
log_progress – If true, process iterating through all edges will be continuously reported.
-
subset
(*regions, **kwargs)¶ Subset a Hic object by specifying one or more subset regions.
- Parameters
regions – string or GenomicRegion object(s)
kwargs – Supports file_name: destination file name of subset Hic object; tmpdir: if True works in tmp until object is closed additional parameters are passed to
edges()
- Returns
Hic
-
to_bed
(file_name, subset=None, **kwargs)¶ Export regions as BED file
- Parameters
file_name – Path of file to write regions to
subset – optional
GenomicRegion
or str to write only regions overlapping this regionkwargs – Passed to
write_bed()
-
to_bigwig
(file_name, subset=None, **kwargs)¶ Export regions as BigWig file.
- Parameters
file_name – Path of file to write regions to
subset – optional
GenomicRegion
or str to write only regions overlapping this regionkwargs – Passed to
write_bigwig()
-
to_gff
(file_name, subset=None, **kwargs)¶ Export regions as GFF file
- Parameters
file_name – Path of file to write regions to
subset – optional
GenomicRegion
or str to write only regions overlapping this regionkwargs – Passed to
write_gff()
-
class
-
class
fanc.pairs.
SamBamReadPairGenerator
(sam_file1, sam_file2, check_sorted=True)¶ Bases:
fanc.pairs.ReadPairGenerator
Generate read pairs from paired-end SAM/BAM files.
This
ReadPairGenerator
iterates over two SAM or BAM files that have been sorted by qname (for example withsamtools sort -n
).Chimeric reads (mapping partially to multiple genomic locations), such as output by BWA, are handled as follows: If both mate pairs are chimeric, they are removed. If only one mate is chimeric, but it is split into more than 2 alignments, it is also removed. If one mate is chimeric and split into two alignments. If one part of the chimeric alignment maps within 100bp of the regular alignment, the read pair is kept and returned. In all other cases, the pair is removed.
-
add_filter
(read_filter)¶ Add a
ReadFilter
to this object.The filter will be applied during read pair generation.
- Parameters
read_filter –
ReadFilter
-
filter_multi_mapping
(strict=True)¶ Convenience function that registers a UniquenessFilter. The actual algorithm and rationale used for filtering will depend on the internal _mapper attribute.
- Parameters
strict – If True will filter if XS tag is present. If False, will filter only when XS tag is not 0. This is applied if alignments are from bowtie2.
-
filter_quality
(cutoff=30)¶ Convenience function that registers a QualityFilter. The actual algorithm and rationale used for filtering will depend on the internal _mapper attribute.
- Parameters
cutoff – Minimum mapping quality (mapq) a read must have to pass the filter
-
filter_unmapped
()¶ Convenience function that registers an UnmappedFilter.
-
stats
()¶ Return filter statistics collected during read pair generation.
The
__iter__()
filters reads based on the filters added usingadd_filter()
. During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adict
of the form <filter_name>: <number of reads filtered out>.- Returns
dict
-
-
class
fanc.pairs.
SelfLigationFilter
(mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilter
Filters read pairs where one or both reads map to the same restriction fragment.
-
valid
(row)¶ Map validity check of rows to pairs.
-
valid_pair
(pair)¶ Check if both reads are on the same fragment.
-
-
class
fanc.pairs.
TxtReadPairGenerator
(valid_pairs_file, sep=None, chr1_field=1, pos1_field=2, strand1_field=3, chr2_field=4, pos2_field=5, strand2_field=6)¶ Bases:
fanc.pairs.ReadPairGenerator
Generate read pairs from a plain text file.
This is an implementation of
ReadPairGenerator
for reading read pairs from an arbitrary text file. For specific text file formats have a look atFourDNucleomePairGenerator
orHicProPairGenerator
.TxtReadPairGenerator
iterates over lines in a file and splits them into fields at “sep”. It then extracts the chromosome, position and strand for each read in the pair, according to the fields specified by “chr<1|2>_field”, “pos<1|2>_field”, and “strand<1|2>_field”. If your file does not have strand fields, or if you don’t want to load them, you can simply set them to “None”.-
add_filter
(read_filter)¶ Add a
ReadFilter
to this object.The filter will be applied during read pair generation.
- Parameters
read_filter –
ReadFilter
-
stats
()¶ Return filter statistics collected during read pair generation.
The
__iter__()
filters reads based on the filters added usingadd_filter()
. During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adict
of the form <filter_name>: <number of reads filtered out>.- Returns
dict
-
-
class
fanc.pairs.
UniquenessFilter
(strict=True, mask=None)¶ Bases:
fanc.pairs.ReadFilter
Filter reads that do not map uniquely to the reference sequence.
-
valid_read
(read)¶ Check if a read has an XS tag.
If strict is enabled checks if a read has an XS tag. If not strict, XS has to be smaller than AS (alignment score) for a valid read.
-
-
class
fanc.pairs.
UnmappedFilter
(mask=None)¶ Bases:
fanc.pairs.ReadFilter
Filter reads that do not map to the reference sequence.
-
valid_read
(read)¶ Check if the the flag bit 4 is set.
-
-
fanc.pairs.
generate_pairs
(sam1_file, sam2_file, regions, restriction_enzyme=None, output_file=None, read_filters=(), check_sorted=True, threads=1, batch_size=10000000)¶ Generate Pairs object from SAM/BAM files.
This is a convenience function that let’s you create a Pairs object from SAM/BAM data in a single step.
- Parameters
sam1_file – Path to a sorted SAM/BAM file (1st mate)
sam2_file – Path to a sorted SAM/BAM file (2nd mate)
regions – Path to file with restriction fragments (BED, GFF) or FASTA with genomic sequence
restriction_enzyme – Name of restriction enzyme (only when providing FASTA as regions)
read_filters – List of
ReadFilter
to filter reads while loading from SAM/BAMoutput_file – Path to output file
check_sorted – Double-check that input SAM files are sorted if True (default)
threads – Number of threads used for finding restriction fragments for read pairs
batch_size – Number of read pairs sent to each restriction fragment worker
- Returns
-
fanc.pairs.
generate_pairs_split
(sam1_file, sam2_file, regions, restriction_enzyme=None, output_file=None, read_filters=(), check_sorted=True, threads=1, batch_size=1000000)¶ Generate Pairs object from SAM/BAM files.
This is a convenience function that let’s you create a Pairs object from SAM/BAM data in a single step.
- Parameters
sam1_file – Path to a sorted SAM/BAM file (1st mate)
sam2_file – Path to a sorted SAM/BAM file (2nd mate)
regions – Path to file with restriction fragments (BED, GFF) or FASTA with genomic sequence
restriction_enzyme – Name of restriction enzyme (only when providing FASTA as regions)
read_filters – List of
ReadFilter
to filter reads while loading from SAM/BAMoutput_file – Path to output file
check_sorted – Double-check that input SAM files are sorted if True (default)
threads – Number of threads used for finding restriction fragments for read pairs
batch_size – Number of read pairs sent to each restriction fragment worker
- Returns