Introduction

GIA: Genome Interval Arithmetic

gia is a free and open-source command-line tool for highly efficient and scalable set operations on genomic interval data.

It is inspired by the open source command-line tools bedtools and bedops and aims to be a drop-in replacement to both.

gia is written in rust and distributed via cargo. It is a command-line tool built on top of bedrs, a separate and abstracted genomic interval library.

Issues and Contributions

gia is a work-in-progress and under active development by Noam Teyssier.

If you are interested in building more functionality or want to get involved please don't hesitate to reach out!

Please address all issues to future contributors.

Installation

gia is distributed using the rust package manager cargo.

cargo install gia

You can validate the installation by checking gia's help menu:

gia --help

Installing cargo

You can install cargo by following the instructions here

Background

Set Theory on the Genome

Genomic Intervals

The core piece of genome interval arithmetic is the interval object:


#![allow(unused)]
fn main() {
GenomicInterval {
    Chromosome,
    Start,
    End,
}
}

This is an abstract object with at minimum 3 attributes defining its chromosome, start, and end positions on the genome.

Genomic Interval Sets

A collection of genomic intervals can be considered a set which in brief is a collection of objects that match particular properties.

There is a branch of mathematics known as set theory which describe a range of operations, such as the union, intersection, difference, complement, etc of those sets.

Some examples of these are shown below:

Intersection

This generates all the intervals that are at the intersection of two interval sets.

(a)   x---------y    x-----------y
(b)     i--j  i--------j    i--------j
========================================
        i--j  i-y    x-j    i----y

Difference

This generates all the intervals in a which are not covered by b. This calculates the difference / relative complement of a set.

(a)  x----------y   x------------y
(b)     i--j  i--------j    i--------j
========================================
     x--i  j--i        j----i    

Complement

This generates all the intervals in a which are not covered by its span (s). This is the absolute complement of the set.

(s)  s----------------------------s
========================================
(a)  x-----y   x------y    x------y
========================================
           y---x      y----x    

Genomic Interval Arithmetic

Genomic interval arithmetic revolves around performing set theoretical operations in the context of genomic regions, and is useful for a wide range of purposes in bioinformatics analyses.

Purpose - i.e. why gia?

gia was developed to split the difference between bedtools and bedops and be a tool that can match both philosophies without sacrificing either efficiency or convenience.

That being said, the author of gia was greatly inspired by both tools and has used them extensively for years.

gia would not exist if not for the work of their authors and maintainers as well as their meticulous documentation.

Philosophies

bedtools - utility over efficiency

bedtools is the original genome interval toolkit and is the go-to for many people.

It prioritizes convenience and utility over computational efficiency, and does that very well.

One of the major design choices for most of the tools in the toolkit is that the genome interval sets are loaded into memory completely before processing occurs.

This incurs a huge memory and computational overhead once genome interval sets get larger and larger - which is increasingly the case for large high throughput genomic datasets.

bedops - efficiency over utility

bedops came later from bedtools and was built for computational efficiency.

Most of the methods within focus around pre-sorted data, and the computational and memory efficiency comes from the fact that everything is built around streams (i.e. intervals are assumed sorted and only kept in memory for the abosolute minimum amount of time required for the operation.)

This leads to highly efficient streaming operations with a constant memory overhead, but provides some inconveniences, as all inputs must be presorted, and some functional limitations, as most of the set operations implicitly merge intervals on input.

gia - both in a single tool

gia was built with the idea that both philosophies are useful for different purposes and that the same operations and underlying implementations can be shared.

By default, all tools are built with an inplace memory load, which allows for the complete set of functionality available in bedtools with no expectation that the dataset is a priori sorted or merged, but where relevant an argument may be passed to allow for streaming operations, which perform highly performant memory constant operations on pre-sorted inputs such as in bedops.

[ gia closest ]

Background

Similar to intersect, closest searches for the closest feature in B for each feature in A.

The closest feature is not necessarily non-overlapping and under certain search constraints may not exist.

Note: Chromosomal Distance

There is no notion of interchromosomal distance, so if an interval is alone on an interval in A, then there will be no closest interval in B.

Usage

See full arguments and options using:

gia closest --help

Default Behavior

(a)           x----------y
(b)   i-----j                 i-------j
=========================================
      i-----j

By default closest will return the single closest, potentially overlapping, feature in B for each interval in A.

Ties will be given to the left-most interval with the closest start coordinate to the query.

gia closest -a <input.bed> -b <input.bed>

Closest Upstream

(a)               x----------y
(b)   i-----j                 i-------j
=========================================
      i-----j

You can explicitly exclude all downstream intervals from the search, regardless of their distance, with the upstream flag.

gia closest -a <input.bed> -b <input.bed> -u

Closest Downstream

(a)          x----------y
(b)   i-----j                 i-------j
=========================================
                              i-------j

You can explicitly exclude all upstream intervals from the search, regardless of their distance, with the downstream flag.

gia closest -a <input.bed> -b <input.bed> -d

[ gia complement ]

Background

Similar to subtract, complement generates all the intervals that are not covered by the input set.

This is equivalent to subtracting the interval set from its span, which is an interval defined by the min and max interval of the set by chromosome.

Usage

See full arguments and options using:

gia complement --help

Default Behavior

(span)        s---------------------------------s
==========================================================
(input)       x------y     x-----y    x---------y
==========================================================
(complement)         y-----x     y----x

By default complement will return all intervals that are uncovered by the span of the incoming interval set.

The span of the interval set is calculated by chromosome.

gia complement -i <input.bed>

Note: Internal vs Complete Complement

The internal complement of a set necessarily excludes the the chromosomal start to the span start and the span end to the chromosomal end.

All other potential intervals are included.

[ gia extend ]

Background

This subcommand is used to grow intervals from either of their terminals. It can be used to either grow both sides, just the left, or just the right. An optional -g flag can be given if the chromosome length is known, which will be used to truncate growths to not overextend the chromosome length.

Usage

See full arguments and options using:

gia extend --help

Extend Left

(input)          x---------y    x------y
==============================================
(output)   x---------------y
                         x-------------y

This grows the intervals to the left but truncates to zero to avoid negatives.

gia extend -i <input.bed> -l 20

Extend Right

(input)          x---------y    x------y
==============================================
(output)         x---------------y
                                x-------------y

This grows the intervals to the right but truncates to chromosome max if a .genome file is provided.

gia extend -i <input.bed> -r 20

Extend Both (Equal)

(input)          x---------y    x------y
==============================================
(output)    x-------------------y
                            x---------------y

This grows the intervals to the left and right by an equal amount. This will truncate to zero if a potential negative is encountered and will truncate to the chromozome max if a .genome file is provided.

gia extend -i <input.bed> -b 20

Extend Both (Unequal)

(input)          x---------y    x------y
==============================================
(output)    x---------------y
                            x-----------y

This grows the intervals to the left and right by an unequal amount. The growth to the left is controlled by the -l flag and the growth to the right is controlled by the -r flag separately.

This will truncate to zero if a potential negative is encountered and will truncate to the chromozome max if a .genome file is provided.

gia extend -i <input.bed> -l 20 -r 5

[ gia get-fasta ]

Background

This subcommand is used to extract sequences from intervals provided in an input BED file from an indexed fasta.

The fasta is assumed indexed using samtools faidx and assumes the index is </path/to.fasta>.fai.

Usage

See full arguments and options using:

gia get-fasta --help

Extract Sequences

If the chromosome names are integers we can extract the sequences using the following command:

gia get-fasta -b <input.bed> -f <input.fa>

Input Integer BED

1	20	30
2	30	40

Input Integer Fasta

>1
ACCCCTATCTATCACACTTCAGCGACTA
CGACTACGACCATCGACGATCAGCATCA
GCATCGACTACGACGATCAGCGACTACG
AGCTACGACGAGCG
>2
GGTAGTTAGTAGAGTTAGACTACGATCG
ATCGATCGATCGAGCGGCGCGCATCGAT
CGTAGCCGCGGCGTACGTAGCGCAGCAG
TCGTAGCTACGTAG

Output Integer Fasta

>1:20-30
AGCGACTACG
>2:30-40
CGATCGATCG

Extract Sequences with non-integer named bed and chromosome names

If the chromosome names are non-integers, gia can handle the conversion and no extra flags are required.

gia get-fasta -b <input.bed> -f <input.fa>

Input Non-Integer BED

chr1	20	30
chr2	30	40

Input Non-Integer Fasta

>chr1
ACCCCTATCTATCACACTTCAGCGACTA
CGACTACGACCATCGACGATCAGCATCA
GCATCGACTACGACGATCAGCGACTACG
AGCTACGACGAGCG
>chr2
GGTAGTTAGTAGAGTTAGACTACGATCG
ATCGATCGATCGAGCGGCGCGCATCGAT
CGTAGCCGCGGCGTACGTAGCGCAGCAG
TCGTAGCTACGTAG

Output Non-Integer Fasta

>chr1:20-30
AGCGACTACG
>chr2:30-40
CGATCGATCG

[ gia intersect ]

Background

This subcommand calculates the genomic regions found at the intersection of the first input (query) and the second input (target).

(query)     x-------------y        x----------y   x------y
(target)        i------j        i------j        i-----------j
=================================================================
                i------j           x---j          x------y

Usage

See full arguments and options using:

gia intersect --help

Default Behavior

By default the intervals in the query and target are not assumed sorted or merged and intersections are performed inplace.

gia intersect -a <query.bed> -b <target.bed>

Streamed Intervals

If the intervals in the query and target are presorted then we can operate on them as a stream - which keeps the memory usage constant and leads to speedups at very high number of intervals.

gia intersect -a <query.bed> -b <target.bed> -S

Fractional Overlaps

We can also define conditional operations on fractional overlaps of the query, target, or both.

This means that the intersection will only be done on query-target pairs which meet the fractional overlap predicate provided.

On Query

              x-------------y   x---------------y
              i---------j       i--j
======================================================
(-f 0.5)
(ix)          x---------j

We can supply a minimum overlap requirement on the query with the -f flag.

This will only apply intersection operations on query-target pairs in which the target overlaps the query by the amount required in the -f argument.

In the example case, only the first query-target pair was operated upon since the second did not overlap the query by 50%.

gia intersect -a <query.bed> -b <target.bed> -f 0.5

On Target

        x-------------y   x---------------y
              i---------j               i----------j
======================================================
(-F 0.5)
(ix)          i-------y

We can supply a minimum overlap requirement on the target with the -F flag.

This will only apply intersection operations on query-target pairs in which the query overlaps the target by the amount required in the -F argument.

In the example case, only the first query-target pair was operated upon since the second did not overlap the target by 50%.

gia intersect -a <query.bed> -b <target.bed> -F 0.5

Reciprocal

We can introduce a reciprocal argument (-r) which requires the -f flag and requires that both query and target meet the same requirements of the flag.

gia intersect -a <query.bed> -b <target.bed> -f 0.5 -r

Either

We can introduce the either flag (-e) which is used with both the -f and -F flags. This will require that either condition is met and include those subtraction operations.

gia intersect -a <query.bed> -b <target.bed> -f 0.5 -F 0.3 -e

Reporting Query, Target, or Inverse

In the cases where we're interested in specifically what intervals are overlapping and not necessarily their specific intersections we can report either the query or target interval for each potential intersection event.

# Reports all query intervals that intersect the target intervals
gia intersect -a <query.bed> -b <target.bed> -q

# Reports all query intervals that intersect the target intervals (only once)
gia intersect -a <query.bed> -b <target.bed> -q -u

# Reports all target intervals that intersect the query intervals
gia intersect -a <query.bed> -b <target.bed> -t

We can also report all query intervals that do not intersect the target intervals.

This is an homage to grep -v

# Reports all query intervals that do not intersect the target intervals
gia intersect -a <query.bed> -b <target.bed> -v

[ gia merge ]

Background

This subcommand will merge all overlapping and bordering intervals within the input set.

It can accept either a presorted input (which can be streamed) or an unsorted input which will first be sorted in-place before merging.

Usage

See full arguments and options using:

gia merge --help

Default Behavior (Inplace)

(e)                    q----r
(b)      k----l
(a)    i----j
(c)           l----m
(d)                  o----p
===============================
(1)    i-----------m
(2)                  o------r

This will merge all overlapping and bordering intervals into their sub-spans.

It does not assume presorted input and will sort the input inplace on load.

gia merge -i <input.bed>

Streamable Input

(a)    i----j
(b)      k----l
(c)           l----m
(d)                  o----p
(e)                    q----r
===============================
(1)    i-----------m
(2)                  o------r

This will merge all overlapping and bordering intervals into their sub-spans.

This assumes presorted input and will have undefined behavior is input is not-sorted.

gia merge -i <input.bed> -S

[ gia random ]

Background

This subcommand will generate a random set of intervals.

Usage

See full arguments and options using:

gia random --help

Parameterization

This generates intervals given:

  • Number of Intervals
  • Length of Intervals
  • Number of Chromosomes
  • Max Chromosome Length

The intervals are drawn randomly from a random chromosome and are each equally sized to the length provided.

An optional *.genome file can be given to provide multiple and potentially differently sized chromosome lengths.

# will randomly generate intervals given default params
gia random

# sets the number of chromosomes to 10
gia random -c 10

# generates 100 random intervals
gia random -n 100

# generates 100 random intervals each with a length of 500
gia random -n 100 -l 500

# generates 100 random intervals with a length of 500 using a prior genome
gia random -n 100 -l 500 -g <my.genome>

[ gia sample ]

Background

This subcommand will randomly sample intervals from an input file.

Note: The output of this command is not guaranteed to be sorted.

Usage

See full arguments and options using:

gia sample --help

Sample a fixed number of intervals

You can sample a fixed number of intervals from the dataset using the -n flag

# samples 100 intervals from the dataset
gia sample -i <input.bed> -n 100

Sample a fractional number of intervals

You can sample a fractional number of intervals from the total size of the dataset using the -f flag.

# samples 0.1 (10%) of the intervals
gia sample -i <input.bed> -f 0.1

[ gia sort ]

Background

This subcommand will sort intervals from an input file.

Note: Chromosome Ordering

If named chromosomes are provided they will be sorted by the order in which they appear in the input file. In the future this will likely be lexicographically ordered.

Usage

See full arguments and options using:

gia sort --help

Sort an input BED

You can sort an input file with the following command:

gia sort -i <input.bed>

Sort a named input BED

If the chromosome names are non-integers you can run the following command:

gia sort -i <input.bed> -N

[ gia subtract ]

Background

This subcommand calculates the genomic regions found within the first input (query) while excluding regions in the second (target).

(query)     x-------------y        x----------y   x------y
(target)        i------j        i------j        i-----------j
=================================================================
            x---i      j--y     i--x   j------y i-x      y--j

Usage

See full arguments and options using:

gia subtract --help

Default Behavior (Merging)

By default the intervals in the query and target are merged before subtraction to calculate the difference across the subspans.

(query)            x-------y
                       x-------y
                          x----------y
                   
(target)               i----j
                            i------j
=============================================
(merged_q)        x------------------y
(merged_t)             i-----------j
=============================================
(subtraction)     x----i           j-y
gia subtract -a <query.bed> -b <target.bed>

Unmerged Subtraction

If you would like to keep the intervals separate on merging then you can supply the -u flag to keep the intervals separate.

(query)            x---------y
                       x-------y
                          x----------y
                   
(target)               i---j
                           i-------j
=============================================
(subtraction)     x----i
                           j-y
                                   j-y

This is useful in cases where the intervals are separate classes and it doesn't make sense biologically to merge their interval spans.

gia subtract -a <query.bed> -b <target.bed> -u

Fractional Overlap

We can also define conditional subtraction operations on fractional overlaps of the query, target, or both.

This means that the subtraction will only be done on query-target pairs which meet the fractional overlap predicate provided.

On Query

(query)       x------------------------y    x------y
(target)                 i--j                  i---j
======================================================
(-f 0.5)
(sub)         x------------------------y    x--i

We can supply a minimum overlap requirement on the query with the -f flag.

This will only apply subtraction operations on query-target pairs in which the target overlaps the query by the amount required in the -f argument.

In the example case, only the second query-target pair was operated upon since the first did not overlap the query by 50%.

gia subtract -a <query.bed> -b <target.bed> -f 0.5

On Target

(query)       x------------------------y    x------y
(target)                 i--j                  i--------------j
======================================================
(-F 0.5)
(sub)         x----------i  j----------y    x------y

We can supply a minimum overlap requirement on the target with the -F flag.

This will only apply subtraction operations on query-target pairs in which the query overlaps the target by the amount required in the -F argument.

In the example case, only the first query-target pair was operated upon since the second did not overlap the target by 50%.

gia subtract -a <query.bed> -b <target.bed> -f 0.5

Reciprocal

We can introduce a reciprocal argument (-r) which requires the -f flag and requires that both query and target meet the same requirements of the flag.

gia subtract -a <query.bed> -b <target.bed> -f 0.5 -r

Either

We can introduce the either flag (-e) which is used with both the -f and -F flags. This will require that either condition is met and include those subtraction operations.

gia subtract -a <query.bed> -b <target.bed> -f 0.5 -F 0.3 -e