Introduction
GIA: Genome Interval Arithmetic
gia
is a free and open-source command-line tool for highly efficient
and scalable set operations on genomic interval data.
It is inspired by the open source command-line tools bedtools
and bedops
and aims to be a drop-in
replacement to both.
gia
is written in rust and distributed via cargo
.
It is a command-line tool built on top of bedrs
,
a separate and abstracted genomic interval library.
Issues and Contributions
gia
is a work-in-progress and under active development by Noam Teyssier.
If you are interested in building more functionality or want to get involved please don't hesitate to reach out!
Please address all issues to future contributors.
Installation
gia
is distributed using the rust package manager cargo
.
cargo install gia
You can validate the installation by checking gia
's help menu:
gia --help
Installing cargo
You can install cargo
by following the instructions here
Background
Set Theory on the Genome
Genomic Intervals
The core piece of genome interval arithmetic is the interval
object:
#![allow(unused)] fn main() { GenomicInterval { Chromosome, Start, End, } }
This is an abstract object with at minimum 3 attributes defining its chromosome, start, and end positions on the genome.
Genomic Interval Sets
A collection of genomic intervals can be considered a set which in brief is a collection of objects that match particular properties.
There is a branch of mathematics known as set theory which describe a range of operations, such as the union, intersection, difference, complement, etc of those sets.
Some examples of these are shown below:
Intersection
This generates all the intervals that are at the intersection of two interval sets.
(a) x---------y x-----------y
(b) i--j i--------j i--------j
========================================
i--j i-y x-j i----y
Difference
This generates all the intervals in a
which are not covered by b
.
This calculates the difference / relative complement
of a set.
(a) x----------y x------------y
(b) i--j i--------j i--------j
========================================
x--i j--i j----i
Complement
This generates all the intervals in a
which are not covered by its span (s
).
This is the absolute complement
of the set.
(s) s----------------------------s
========================================
(a) x-----y x------y x------y
========================================
y---x y----x
Genomic Interval Arithmetic
Genomic interval arithmetic revolves around performing set theoretical operations in the context of genomic regions, and is useful for a wide range of purposes in bioinformatics analyses.
Purpose - i.e. why gia?
gia
was developed to split the difference between bedtools
and bedops
and be a tool that can
match both philosophies without sacrificing either efficiency or convenience.
That being said, the author of gia
was greatly inspired by both tools and has
used them extensively for years.
gia
would not exist if not for the work of their authors and maintainers as well
as their meticulous documentation.
Philosophies
bedtools
- utility over efficiency
bedtools
is the original
genome interval toolkit and is the go-to for many people.
It prioritizes convenience and utility over computational efficiency, and does that very well.
One of the major design choices for most of the tools in the toolkit is that the genome interval sets are loaded into memory completely before processing occurs.
This incurs a huge memory and computational overhead once genome interval sets get larger and larger - which is increasingly the case for large high throughput genomic datasets.
bedops
- efficiency over utility
bedops
came later from bedtools
and was built for computational efficiency.
Most of the methods within focus around pre-sorted data, and the computational and memory efficiency comes from the fact that everything is built around streams (i.e. intervals are assumed sorted and only kept in memory for the abosolute minimum amount of time required for the operation.)
This leads to highly efficient streaming operations with a constant memory overhead, but provides some inconveniences, as all inputs must be presorted, and some functional limitations, as most of the set operations implicitly merge intervals on input.
gia
- both in a single tool
gia
was built with the idea that both philosophies are useful for different
purposes and that the same operations and underlying implementations can be
shared.
By default, all tools are built with an inplace memory load, which allows
for the complete set of functionality available in bedtools
with no expectation
that the dataset is a priori sorted or merged, but where relevant an argument
may be passed to allow for streaming operations, which perform highly performant
memory constant operations on pre-sorted inputs such as in bedops
.
[ gia closest ]
Background
Similar to intersect, closest
searches for the closest
feature in B
for each feature in A
.
The closest
feature is not necessarily non-overlapping and under certain
search constraints may not exist.
Note: Chromosomal Distance
There is no notion of interchromosomal distance, so if an interval is alone on an interval in
A
, then there will be no closest interval inB
.
Usage
See full arguments and options using:
gia closest --help
Default Behavior
(a) x----------y
(b) i-----j i-------j
=========================================
i-----j
By default closest
will return the single closest, potentially overlapping, feature in B
for each interval in A
.
Ties will be given to the left-most interval with the closest start coordinate to the query.
gia closest -a <input.bed> -b <input.bed>
Closest Upstream
(a) x----------y
(b) i-----j i-------j
=========================================
i-----j
You can explicitly exclude all downstream intervals from the search, regardless of their distance, with the upstream flag.
gia closest -a <input.bed> -b <input.bed> -u
Closest Downstream
(a) x----------y
(b) i-----j i-------j
=========================================
i-------j
You can explicitly exclude all upstream intervals from the search, regardless of their distance, with the downstream flag.
gia closest -a <input.bed> -b <input.bed> -d
[ gia complement ]
Background
Similar to subtract, complement
generates all the intervals
that are not covered by the input set.
This is equivalent to subtracting the interval set from its span, which is an interval defined by the min and max interval of the set by chromosome.
Usage
See full arguments and options using:
gia complement --help
Default Behavior
(span) s---------------------------------s
==========================================================
(input) x------y x-----y x---------y
==========================================================
(complement) y-----x y----x
By default complement
will return all intervals that are uncovered by the
span of the incoming interval set.
The span of the interval set is calculated by chromosome.
gia complement -i <input.bed>
Note: Internal vs Complete Complement
The internal complement of a set necessarily excludes the the chromosomal start to the span start and the span end to the chromosomal end.
All other potential intervals are included.
[ gia extend ]
Background
This subcommand is used to grow intervals from either of their terminals.
It can be used to either grow both sides, just the left, or just the right.
An optional -g
flag can be given if the chromosome length is known, which will
be used to truncate growths to not overextend the chromosome length.
Usage
See full arguments and options using:
gia extend --help
Extend Left
(input) x---------y x------y
==============================================
(output) x---------------y
x-------------y
This grows the intervals to the left but truncates to zero to avoid negatives.
gia extend -i <input.bed> -l 20
Extend Right
(input) x---------y x------y
==============================================
(output) x---------------y
x-------------y
This grows the intervals to the right but truncates to chromosome max
if a .genome
file is provided.
gia extend -i <input.bed> -r 20
Extend Both (Equal)
(input) x---------y x------y
==============================================
(output) x-------------------y
x---------------y
This grows the intervals to the left and right by an equal amount.
This will truncate to zero if a potential negative is encountered
and will truncate to the chromozome max if a .genome
file is provided.
gia extend -i <input.bed> -b 20
Extend Both (Unequal)
(input) x---------y x------y
==============================================
(output) x---------------y
x-----------y
This grows the intervals to the left and right by an unequal amount.
The growth to the left is controlled by the -l
flag and the growth
to the right is controlled by the -r
flag separately.
This will truncate to zero if a potential negative is encountered
and will truncate to the chromozome max if a .genome
file is provided.
gia extend -i <input.bed> -l 20 -r 5
[ gia get-fasta ]
Background
This subcommand is used to extract sequences from intervals provided in an input BED file from an indexed fasta.
The fasta is assumed indexed using samtools faidx
and assumes the index is </path/to.fasta>.fai
.
Usage
See full arguments and options using:
gia get-fasta --help
Extract Sequences
If the chromosome names are integers we can extract the sequences using the following command:
gia get-fasta -b <input.bed> -f <input.fa>
Input Integer BED
1 20 30
2 30 40
Input Integer Fasta
>1
ACCCCTATCTATCACACTTCAGCGACTA
CGACTACGACCATCGACGATCAGCATCA
GCATCGACTACGACGATCAGCGACTACG
AGCTACGACGAGCG
>2
GGTAGTTAGTAGAGTTAGACTACGATCG
ATCGATCGATCGAGCGGCGCGCATCGAT
CGTAGCCGCGGCGTACGTAGCGCAGCAG
TCGTAGCTACGTAG
Output Integer Fasta
>1:20-30
AGCGACTACG
>2:30-40
CGATCGATCG
Extract Sequences with non-integer named bed and chromosome names
If the chromosome names are non-integers, gia
can handle the conversion
and no extra flags are required.
gia get-fasta -b <input.bed> -f <input.fa>
Input Non-Integer BED
chr1 20 30
chr2 30 40
Input Non-Integer Fasta
>chr1
ACCCCTATCTATCACACTTCAGCGACTA
CGACTACGACCATCGACGATCAGCATCA
GCATCGACTACGACGATCAGCGACTACG
AGCTACGACGAGCG
>chr2
GGTAGTTAGTAGAGTTAGACTACGATCG
ATCGATCGATCGAGCGGCGCGCATCGAT
CGTAGCCGCGGCGTACGTAGCGCAGCAG
TCGTAGCTACGTAG
Output Non-Integer Fasta
>chr1:20-30
AGCGACTACG
>chr2:30-40
CGATCGATCG
[ gia intersect ]
Background
This subcommand calculates the genomic regions found at the intersection of the first input (query) and the second input (target).
(query) x-------------y x----------y x------y
(target) i------j i------j i-----------j
=================================================================
i------j x---j x------y
Usage
See full arguments and options using:
gia intersect --help
Default Behavior
By default the intervals in the query and target are not assumed sorted or merged and intersections are performed inplace.
gia intersect -a <query.bed> -b <target.bed>
Streamed Intervals
If the intervals in the query and target are presorted then we can operate on them as a stream - which keeps the memory usage constant and leads to speedups at very high number of intervals.
gia intersect -a <query.bed> -b <target.bed> -S
Fractional Overlaps
We can also define conditional operations on fractional overlaps of the query, target, or both.
This means that the intersection will only be done on query-target pairs which meet the fractional overlap predicate provided.
On Query
x-------------y x---------------y
i---------j i--j
======================================================
(-f 0.5)
(ix) x---------j
We can supply a minimum overlap requirement on the query with the -f
flag.
This will only apply intersection operations on query-target pairs in which the
target overlaps the query by the amount required in the -f
argument.
In the example case, only the first query-target pair was operated upon since the second did not overlap the query by 50%.
gia intersect -a <query.bed> -b <target.bed> -f 0.5
On Target
x-------------y x---------------y
i---------j i----------j
======================================================
(-F 0.5)
(ix) i-------y
We can supply a minimum overlap requirement on the target with the -F
flag.
This will only apply intersection operations on query-target pairs in which the
query overlaps the target by the amount required in the -F
argument.
In the example case, only the first query-target pair was operated upon since the second did not overlap the target by 50%.
gia intersect -a <query.bed> -b <target.bed> -F 0.5
Reciprocal
We can introduce a reciprocal argument (-r
) which requires the -f
flag
and requires that both query and target meet the same requirements of the flag.
gia intersect -a <query.bed> -b <target.bed> -f 0.5 -r
Either
We can introduce the either flag (-e
) which is used with both the -f
and -F
flags.
This will require that either condition is met and include those subtraction operations.
gia intersect -a <query.bed> -b <target.bed> -f 0.5 -F 0.3 -e
Reporting Query, Target, or Inverse
In the cases where we're interested in specifically what intervals are overlapping and not necessarily their specific intersections we can report either the query or target interval for each potential intersection event.
# Reports all query intervals that intersect the target intervals
gia intersect -a <query.bed> -b <target.bed> -q
# Reports all query intervals that intersect the target intervals (only once)
gia intersect -a <query.bed> -b <target.bed> -q -u
# Reports all target intervals that intersect the query intervals
gia intersect -a <query.bed> -b <target.bed> -t
We can also report all query intervals that do not intersect the target intervals.
This is an homage to grep -v
# Reports all query intervals that do not intersect the target intervals
gia intersect -a <query.bed> -b <target.bed> -v
[ gia merge ]
Background
This subcommand will merge all overlapping and bordering intervals within the input set.
It can accept either a presorted input (which can be streamed) or an unsorted input which will first be sorted in-place before merging.
Usage
See full arguments and options using:
gia merge --help
Default Behavior (Inplace)
(e) q----r
(b) k----l
(a) i----j
(c) l----m
(d) o----p
===============================
(1) i-----------m
(2) o------r
This will merge all overlapping and bordering intervals into their sub-spans.
It does not assume presorted input and will sort the input inplace on load.
gia merge -i <input.bed>
Streamable Input
(a) i----j
(b) k----l
(c) l----m
(d) o----p
(e) q----r
===============================
(1) i-----------m
(2) o------r
This will merge all overlapping and bordering intervals into their sub-spans.
This assumes presorted input and will have undefined behavior is input is not-sorted.
gia merge -i <input.bed> -S
[ gia random ]
Background
This subcommand will generate a random set of intervals.
Usage
See full arguments and options using:
gia random --help
Parameterization
This generates intervals given:
- Number of Intervals
- Length of Intervals
- Number of Chromosomes
- Max Chromosome Length
The intervals are drawn randomly from a random chromosome and are each equally sized to the length provided.
An optional *.genome
file can be given to provide multiple and potentially
differently sized chromosome lengths.
# will randomly generate intervals given default params
gia random
# sets the number of chromosomes to 10
gia random -c 10
# generates 100 random intervals
gia random -n 100
# generates 100 random intervals each with a length of 500
gia random -n 100 -l 500
# generates 100 random intervals with a length of 500 using a prior genome
gia random -n 100 -l 500 -g <my.genome>
[ gia sample ]
Background
This subcommand will randomly sample intervals from an input file.
Note: The output of this command is not guaranteed to be sorted.
Usage
See full arguments and options using:
gia sample --help
Sample a fixed number of intervals
You can sample a fixed number of intervals from the dataset using the -n
flag
# samples 100 intervals from the dataset
gia sample -i <input.bed> -n 100
Sample a fractional number of intervals
You can sample a fractional number of intervals from the total size of the dataset
using the -f
flag.
# samples 0.1 (10%) of the intervals
gia sample -i <input.bed> -f 0.1
[ gia sort ]
Background
This subcommand will sort intervals from an input file.
Note: Chromosome Ordering
If named chromosomes are provided they will be sorted by the order in which they appear in the input file. In the future this will likely be lexicographically ordered.
Usage
See full arguments and options using:
gia sort --help
Sort an input BED
You can sort an input file with the following command:
gia sort -i <input.bed>
Sort a named input BED
If the chromosome names are non-integers you can run the following command:
gia sort -i <input.bed> -N
[ gia subtract ]
Background
This subcommand calculates the genomic regions found within the first input (query) while excluding regions in the second (target).
(query) x-------------y x----------y x------y
(target) i------j i------j i-----------j
=================================================================
x---i j--y i--x j------y i-x y--j
Usage
See full arguments and options using:
gia subtract --help
Default Behavior (Merging)
By default the intervals in the query and target are merged before subtraction to calculate the difference across the subspans.
(query) x-------y
x-------y
x----------y
(target) i----j
i------j
=============================================
(merged_q) x------------------y
(merged_t) i-----------j
=============================================
(subtraction) x----i j-y
gia subtract -a <query.bed> -b <target.bed>
Unmerged Subtraction
If you would like to keep the intervals separate on merging then
you can supply the -u
flag to keep the intervals separate.
(query) x---------y
x-------y
x----------y
(target) i---j
i-------j
=============================================
(subtraction) x----i
j-y
j-y
This is useful in cases where the intervals are separate classes and it doesn't make sense biologically to merge their interval spans.
gia subtract -a <query.bed> -b <target.bed> -u
Fractional Overlap
We can also define conditional subtraction operations on fractional overlaps of the query, target, or both.
This means that the subtraction will only be done on query-target pairs which meet the fractional overlap predicate provided.
On Query
(query) x------------------------y x------y
(target) i--j i---j
======================================================
(-f 0.5)
(sub) x------------------------y x--i
We can supply a minimum overlap requirement on the query with the -f
flag.
This will only apply subtraction operations on query-target pairs in which the
target overlaps the query by the amount required in the -f
argument.
In the example case, only the second query-target pair was operated upon since the first did not overlap the query by 50%.
gia subtract -a <query.bed> -b <target.bed> -f 0.5
On Target
(query) x------------------------y x------y
(target) i--j i--------------j
======================================================
(-F 0.5)
(sub) x----------i j----------y x------y
We can supply a minimum overlap requirement on the target with the -F
flag.
This will only apply subtraction operations on query-target pairs in which the
query overlaps the target by the amount required in the -F
argument.
In the example case, only the first query-target pair was operated upon since the second did not overlap the target by 50%.
gia subtract -a <query.bed> -b <target.bed> -f 0.5
Reciprocal
We can introduce a reciprocal argument (-r
) which requires the -f
flag
and requires that both query and target meet the same requirements of the flag.
gia subtract -a <query.bed> -b <target.bed> -f 0.5 -r
Either
We can introduce the either flag (-e
) which is used with both the -f
and -F
flags.
This will require that either condition is met and include those subtraction operations.
gia subtract -a <query.bed> -b <target.bed> -f 0.5 -F 0.3 -e