Regions module¶
Module for handling genomic regions such as chromosomes, bins, and restriction fragments.
The class RegionsTable
is an implementation of the RegionBased
interface from the genomic_regions
package. More details on how to use the
RegionBased
interface can be found in the genomic_regions
documentation, but here is an example to get you started:
import fanc
rt = fanc.RegionsTable()
# demo how to add regions
regions = []
for chromosome in ['chr1', 'chr2']:
for start in range(1, 10000, 1000):
r = fanc.GenomicRegion(chromosome=chromosome, start=start, end=start+999)
regions.append(r)
rt.add_regions(regions)
# query regions on chromosome 1
for r in rt.regions('chr1'):
print(r) # chr1:1-1000, chr1:1001-2000, ..., chr1:9001-10000
The class Chromosome
holds chromosome information, specifically the
chromosome name in the reference genome, its length in base pairs, and its DNA
sequence. Multiple Chromosome
objects can be grouped into a Genome
,
which provides convenient access to its chromosomes and has useful functions for
in silico digestion and genome binning.
import fanc
# create chromosomes
chromosome1 = fanc.Chromosome(name='chr1', sequence='AAGTCCGTGCTGTCGATCATAGCTAGCTAGCTA')
chromosome2 = fanc.Chromosome(name='chr2', sequence='GTGTCGATCAAATCGAAA')
len(chromosome1) # 33
# create genome
genome = fanc.Genome()
genome.add_chromosome(chromosome1)
genome.add_chromosome(chromosome2)
# in silico digestion
restriction_fragments = genome.get_regions('MboI') # cuts 'GATC'
for r in restriction_fragments.regions:
print(r) # chr1:1-15, chr1:16-33, chr2:1-6, chr2:7-18
# binning
bins = genome.get_regions(10) # cuts 'GATC'
for r in bins.regions:
print(r) # chr1:1-10, chr1:11-20, chr1:21-30, chr1:31-33, chr2:1-10, chr2:11-18
# load genome from file
genome_from_file = Genome.from_string("hg19_chr18_19.fa")
genome_from_file.chromosomes() # ['chr18', 'chr19']
-
class
fanc.regions.
Chromosome
(name=None, length=None, sequence=None)¶ Bases:
object
Chromosome data type.
-
name
¶ Name of the chromosome
-
length
¶ Length of the chromosome in base-pairs
-
sequence
¶ Base-pair sequence of DNA in the chromosome
-
classmethod
from_fasta
(file_name, name=None, include_sequence=True)¶ Create a
Chromosome
from a FASTA file.This class method will load a FASTA file and convert it into a
Chromosome
object. If the FASTA file contains multiple sequences, the output will be a list ofChromosome
objects.- Parameters
file_name – Path to the FASTA file
name – Chromosome name. If None (default), will be read from the FASTA file header
include_sequence – If True (default), stores the chromosome sequence in memory. Else, the sequence attribute will be set to None.
- Returns
Chromosome
if there is only a single FASTA sequence in the file, list(Chromosome
) if there are multiple sequences.
-
get_restriction_sites
(restriction_enzyme)¶ Find the restriction sites of a provided enzyme in this chromosome.
Internally uses Biopython to find RE sites.
- Parameters
restriction_enzyme – The name of the restriction enzyme (e.g. HindIII)
- Returns
List of RE sites in base-pairs (1-based)
-
-
class
fanc.regions.
Genome
(file_name=None, chromosomes=None, mode='a', tmpdir=None, _table_name_chromosomes='chromosomes')¶ Bases:
fanc.general.FileGroup
Class representing a collection of chromosomes.
Provides some convenience batch methods that call
Chromosome
methods for every chromosome in this object.-
class
ChromosomeDefinition
¶ Bases:
tables.description.IsDescription
-
add_chromosome
(chromosome)¶ Add a
Chromosome
to this object.Will choose suitable defaults for missing attributes.
- Parameters
chromosome –
Chromosome
object or similar object (e.g. dict) with the same fields
-
chromosomes
()¶ Get list of chromosomes in this object
-
close
(copy_tmp=True, remove_tmp=True)¶ Close this HDF5 file and run exit operations.
If file was opened with tmpdir in read-only mode: close file and delete temporary copy.
If file was opened with tmpdir in write or append mode: Replace original file with copy and delete copy.
- Parameters
copy_tmp – If False, does not overwrite original with modified file.
remove_tmp – If False, does not delete temporary copy of file.
-
classmethod
from_folder
(folder_name, file_name=None, exclude=None, include_sequence=True, tmpdir=None)¶ Load every FASTA file from a folder as a chromosome.
- Parameters
folder_name – Path to the folder to load
file_name – File to save Genome object to
exclude – List or set of chromosome names that should NOT be loaded
include_sequence – If True, will save the chromosome sequences in the Genome object
-
classmethod
from_string
(genome_string, file_name=None, tmpdir=None, mode='a')¶ Convenience function to load a
Genome
from a string.- Parameters
genome_string – Path to FASTA file, path to folder with FASTA files, comma-separated list of paths to FASTA files, path to HDF5 file
file_name – Path to save file
- Returns
A
Genome
object
-
get_regions
(split, file_name=None, chromosomes=None)¶ Extract genomic regions from genome.
Provides two options:
Splits chromosomes at restriction sites if the split parameter is the name of a restriction enzyme.
Splits chromosomes at equi-distant points if split is an integer
- Parameters
split – Name of a restriction enzyme or positive integer
file_name – Name of a file if the result of this method should be saved to file
chromosomes – List of chromosome names to include. Default: all
- Returns
GenomicRegions
-
sub_sequence
(chromosome, start=None, end=None)¶ Extract the chromosome DNA sequence between start and end.
- Parameters
chromosome – Name of chromosome
start – start position in bp (1-based, inclusive)
end – end position in bp (1-based, inclusive)
- Returns
str
-
class
-
class
fanc.regions.
RegionsTable
(file_name=None, mode='a', tmpdir=None, additional_fields=None, _table_name_regions='regions')¶ Bases:
fanc.regions.RegionBasedWithBins
,fanc.general.FileGroup
PyTables Table wrapper for storing genomic regions.
This class is inherited by objects working with lists of genomic regions, such as equidistant bins along chromosomes in a genome (
Hic
) or restriction fragments of genomic DNA (ReadPairs
)Internally, each genomic region is encoded in a PyTables Table and the following region attributes are represented as table columns: ix, chromosome, start, end, and strand. To add additional region attributes, such as a score, use the “additional_fields” parameter of the __init__ method. This must be a dict where the keys are str and values are PyTables column descriptors, such as
StringCol
. Example for adding a score field:import fanc import tables rt = fanc.RegionsTable( additional_fields={'score': tables.Float32Col()} )
-
class
ChromosomeDescription
¶ Bases:
tables.description.IsDescription
Description of the chromosomes in this object.
-
class
RegionDescription
¶ Bases:
tables.description.IsDescription
Description of a genomic region for PyTables Table
-
add_region
(region, *args, **kwargs)¶ Add a genomic region to this object.
This method offers some flexibility in the types of objects that can be loaded. See parameters for details.
- Parameters
region – Can be a
GenomicRegion
, a str in the form ‘<chromosome>:<start>-<end>[:<strand>], a dict with at least the fields ‘chromosome’, ‘start’, and ‘end’, optionally ‘ix’, or a list of length 3 (chromosome, start, end) or 4 (ix, chromosome, start, end).
-
add_regions
(regions, *args, **kwargs)¶ Bulk insert multiple genomic regions.
- Parameters
regions – List (or any iterator) with objects that describe a genomic region. See
add_region
for options.
-
static
bin_intervals
(intervals, bins, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)¶ Bin a given set of intervals into a fixed number of bins.
- Parameters
intervals – iterator of tuples (start, end, score)
bins – Number of bins to divide the region into
interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
- Returns
iterator of tuples: (start, end, score)
-
static
bin_intervals_equidistant
(intervals, bin_size, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)¶ Bin a given set of intervals into bins with a fixed size.
- Parameters
intervals – iterator of tuples (start, end, score)
bin_size – Size of each bin in base pairs
interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
- Returns
iterator of tuples: (start, end, score)
-
property
bin_size
¶ Return the length of the first region in the dataset.
Assumes all bins have equal size.
- Returns
int
-
binned_regions
(region=None, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, *args, **kwargs)¶ Same as region_intervals, but returns
GenomicRegion
objects instead of tuples.- Parameters
region – String or class:~GenomicRegion object denoting the region to be binned
bins – Number of bins to divide the region into
bin_size – Size of each bin (alternative to bins argument)
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
args – Arguments passed to _region_intervals
kwargs – Keyword arguments passed to _region_intervals
- Returns
iterator of
GenomicRegion
objects
-
bins_to_distance
(bins)¶ Convert fraction of bins to base pairs
- Parameters
bins – float, fraction of bins
- Returns
int, base pairs
-
property
chromosome_bins
¶ Returns a dictionary of chromosomes and the start and end index of the bins they cover.
Returned list is range-compatible, i.e. chromosome bins [0,5] cover chromosomes 1, 2, 3, and 4, not 5.
-
property
chromosome_lengths
¶ Returns a dictionary of chromosomes and their length in bp.
-
chromosomes
()¶ List all chromosomes in this regions table. :return: list of chromosome names.
-
close
(copy_tmp=True, remove_tmp=True)¶ Close this HDF5 file and run exit operations.
If file was opened with tmpdir in read-only mode: close file and delete temporary copy.
If file was opened with tmpdir in write or append mode: Replace original file with copy and delete copy.
- Parameters
copy_tmp – If False, does not overwrite original with modified file.
remove_tmp – If False, does not delete temporary copy of file.
-
distance_to_bins
(distance)¶ Convert base pairs to fraction of bins.
- Parameters
distance – distance in base pairs
- Returns
float, distance as fraction of bin size
-
find_region
(query_regions, _regions_dict=None, _region_ends=None, _chromosomes=None)¶ Find the region that is at the center of a region.
- Parameters
query_regions – Region selector string, :class:~GenomicRegion, or list of the former
- Returns
index (or list of indexes) of the region at the center of the query region
-
flush
()¶ Write buffered data to file.
-
intervals
(*args, **kwargs)¶ Alias for region_intervals.
-
region_bins
(*args, **kwargs)¶ Return slice of start and end indices spanned by a region.
- Parameters
args – provide a
GenomicRegion
here to get the slice of start and end bins of onlythis region. To get the slice over all regions leave this blank.- Returns
-
region_data
(key, value=None)¶ Retrieve or add vector-data to this object. If there is existing data in this object with the same name, it will be replaced
- Parameters
key – Name of the data column
value – vector with region-based data (one entry per region)
-
region_intervals
(region, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, score_field='score', *args, **kwargs)¶ Return equally-sized genomic intervals and associated scores.
Use either bins or bin_size argument to control binning.
- Parameters
region – String or class:~GenomicRegion object denoting the region to be binned
bins – Number of bins to divide the region into
bin_size – Size of each bin (alternative to bins argument)
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
args – Arguments passed to _region_intervals
kwargs – Keyword arguments passed to _region_intervals
- Returns
iterator of tuples: (start, end, score)
-
region_subset
(region, *args, **kwargs)¶ Takes a class:~GenomicRegion and returns all regions that overlap with the supplied region.
- Parameters
region – String or class:~GenomicRegion object for which covered bins will be returned.
-
property
regions
¶ Iterate over genomic regions in this object.
Will return a
GenomicRegion
object in every iteration. Can also be used to get the number of regions by calling len() on the object returned by this method.- Returns
RegionIter
-
property
regions_dict
¶ Return a dictionary with region index as keys and regions as values.
- Returns
dict {region.ix: region, …}
-
to_bed
(file_name, subset=None, **kwargs)¶ Export regions as BED file
- Parameters
file_name – Path of file to write regions to
subset – optional
GenomicRegion
or str to write only regions overlapping this regionkwargs – Passed to
write_bed()
-
to_bigwig
(file_name, subset=None, **kwargs)¶ Export regions as BigWig file.
- Parameters
file_name – Path of file to write regions to
subset – optional
GenomicRegion
or str to write only regions overlapping this regionkwargs – Passed to
write_bigwig()
-
to_gff
(file_name, subset=None, **kwargs)¶ Export regions as GFF file
- Parameters
file_name – Path of file to write regions to
subset – optional
GenomicRegion
or str to write only regions overlapping this regionkwargs – Passed to
write_gff()
-
class
-
class
fanc.regions.
LazyGenomicRegion
(row, ix=None, auto_update=True)¶ Bases:
genomic_regions.regions.GenomicRegion
A
GenomicRegion
object with lazy attribute loading.This class is central to an efficient retrieval of regions from objects subclassing
RegionsTable
. Its handling should be mostly identical toGenomicRegion
, but attributes will only be loaded on demand. Changes to attributes will change the underlying row in the HDF5 regions table if the auto_update parameter is set to True (default). Else you can manually callupdate()
.-
as_bed_line
(score_field='score', name_field='name')¶ Return a representation of this object as line in a BED file.
- Parameters
score_field – name of the attribute to be used in the ‘score’ field of the BED line
name_field – name of the attribute to be used in the ‘name’ field of the BED line
- Returns
str
-
as_gff_line
(source_field='source', feature_field='feature', score_field='score', frame_field='frame', float_format='.2e')¶ Return a representation of this object as line in a GFF file.
- Parameters
source_field – name of the attribute to be used in the ‘source’ field of the GFF line
feature_field – name of the attribute to be used in the ‘feature’ field of the GFF line
score_field – name of the attribute to be used in the ‘score’ field of the GFF line
frame_field – name of the attribute to be used in the ‘frame’ field of the GFF line
float_format – Formatting string for the float fields
- Returns
str
-
property
attributes
¶ List of all attributes in this region.
-
property
center
¶ Return the center coordinate of the
GenomicRegion
.- Returns
float
-
contains
(region)¶ Check if the specified region is completely contained in this region.
- Parameters
region –
GenomicRegion
object or string
-
copy
()¶ Return a (shallow) copy of this
GenomicRegion
- Returns
GenomicRegion
-
expand
(absolute=None, relative=None, absolute_left=0, absolute_right=0, relative_left=0.0, relative_right=0.0, copy=True, from_center=False)¶ Expand this region by a relative or an absolute amount.
- Parameters
absolute – Absolute amount in base pairs by which to expand the region represented by this
GenomicRegion
object on both sides. New region start will be <old start - absolute>, new region end will be <old end + absolute>relative – Relative amount as fraction of region by which to expand the region represented by this
GenomicRegion
object on both sides. New region start will be <old start - relative*len(self)>, new region end will be <old end + relative*(len(self)>absolute_left – Absolute amount in base pairs by which to expand the region represented by this
GenomicRegion
object on the left sideabsolute_right – Absolute amount in base pairs by which to expand the region represented by this
GenomicRegion
object on the right siderelative_left – Relative amount in base pairs by which to expand the region represented by this
GenomicRegion
object on the left siderelative_right – Relative amount in base pairs by which to expand the region represented by this
GenomicRegion
object on the right sidecopy – If True, return a copy of the original region, if False will modify the existing region in place
from_center – If True measures distance from center rather than start and end of the old region
- Returns
GenomicRegion
-
property
five_prime
¶ Return the position of the 5’ end of this
GenomicRegion
on the reference.- Returns
int
-
fix_chromosome
(copy=False)¶ Change chromosome representation from chr<NN> to <NN> or vice versa.
- Parameters
copy – If True, make copy of region, otherwise will modify existing region in place.
- Returns
GenomicRegion
-
classmethod
from_string
(region_string)¶ Convert a string into a
GenomicRegion
.This is a very useful convenience function to quickly define a
GenomicRegion
object from a descriptor string. Numbers can be abbreviated as ‘12k’, ‘1.5M’, etc.- Parameters
region_string – A string of the form <chromosome>[:<start>-<end>[:<strand>]] (with square brackets indicating optional parts of the string). If any optional part of the string is omitted, intuitive defaults will be chosen.
- Returns
GenomicRegion
-
is_forward
()¶ Return True if this region is on the forward strand of the reference genome.
- Returns
True if on ‘+’ strand, False otherwise.
-
is_reverse
()¶ Return True if this region is on the reverse strand of the reference genome.
- Returns
True if on ‘-‘ strand, False otherwise.
-
property
ix
¶ Region index.
Location in underlying list of regions.
-
overlap
(region)¶ Return the overlap in base pairs between this region and another region.
- Parameters
region –
GenomicRegion
to find overlap for- Returns
overlap as int in base pairs
-
overlaps
(region)¶ Check if this region overlaps with the specified region.
- Parameters
region –
GenomicRegion
object or string
-
set_attribute
(attribute, value)¶ Safely set an attribute on the
GenomicRegion
object.This automatically decodes bytes objects into UTF-8 strings. If you do not care about this, you can also use region.<attribute> = <value> directly.
- Parameters
attribute – Name of the attribute to be set
value – Value of the attribute to be set
-
property
strand
¶ Strand this region is located on as int.
- Returns
1: forward strand, -1 reverse strand, 0 or None: unknown.
-
property
strand_string
¶ Return the ‘strand’ attribute as string.
- Returns
strand as str (‘+’, ‘-‘, or ‘.’)
-
property
three_prime
¶ Return the position of the 3’ end of this
GenomicRegion
on the reference.- Returns
int
-
to_string
()¶ Convert this
GenomicRegion
to its string representation.- Returns
str
-
-
fanc.regions.
genome_regions
(re_or_file, restriction_enzyme=None)¶ Obtain RE fragments or bins of equal size from a reference genome.
- Parameters
re_or_file – Path to or
RegionBased
file with regions corresponding to restriction fragments, or path to FASTA file with chromosomesrestriction_enzyme – If the first argument is a FASTA file, you must provide the name of a restriction enzyme here for in silico genome digestion. You can also provide an integer, in which case the genome will be binned into regions of this size in bp.
- Returns
RegionsTable
of restriction fragments