[ ggetrs
]
Introduction
ggetrs
is a free, open-source command-line tool that enables efficient querying
of genomic databases.
It consists of a collection of separate but interoperable modules, each designed
to facilitate one type of database querying in a single line of code.
This is a rust reimplentation of the original python-based program
gget
and was rewritten to take advantage
of rust's powerful HTTP and asynchronous functionality for a faster user experience.
There are some minor syntactic changes between function calls from the original gget
and a description for each tool is provided on the modules page.
This tool is written fully in rust - but allows for a python interface using pyo3.
If you have questions please check out the FAQ
Installation
ggetrs
can be installed easily using cargo
.
See alternative methods and details for python installation
cargo install ggetrs
Module Overview
Here is a list of currently supported modules
External Links
Installation
Installing via Cargo
This can be installed easily through the rust package manager cargo
.
If you have never used rust before it is easily installed with a single command here.
# install ggetrs from crates.io
cargo install ggetrs
Installing via Github
git clone https://github.com/noamteyssier/ggetrs
cd ggetrs
cargo install --path .
Installing the Python Module
If you are also interested in using the python interface for ggetrs
you will first need to install maturin
and then install ggetrs
.
# clone the repo
git clone https://github.com/noamteyssier/ggetrs
cd ggetrs
# install maturin
pip install maturin
# install ggetrs to your current environment
maturin develop
No conda / venv environment
Currently maturin develop
requires a conda or venv environment to be active before installing a python module, but you can install it manually by first building the wheel then manually pip installing the wheel.
# clone the repo
git clone https://github.com/noamteyssier/ggetrs
cd ggetrs
# install maturin
pip install maturin
# build the python wheel
maturin build
# install the python wheel manually
pip install target/wheels/*.whl
Modules
ggetrs
currently consists of the following modules:
Functional
These modules perform single line utility functions like performing a gene-set enrichment analysis or returning the PDB structure of an input protein.
Module Name | Description |
---|---|
enrichr | Perform an enrichment analysis on a list of genes using Enrichr |
archs4 | Find the most correlated genes to a gene of interest or find the gene's tissue expression atlas using ARCHS4 |
blast | BLAST a nucleotide or amino acid sequence to any BLAST database |
search | Fetch genes and transcripts from Ensembl using free-form search terms. |
info | Fetch extensive gene and transcript metadata from Ensembl, Uniprot, and NCBI. |
seq | Fetch nucleotide or amino acid sequences of genes or transcripts from Ensembl or Uniprot respectively. |
ucsc | Perform a BLAT search using the UCSC Genome Browser. |
pdb | Get structure and metadata of a protein from the RCSB Protein Data Bank |
Database Queries
These modules perform descriptive searches by querying databases directly using their publicly available APIs.
Module Name | Description |
---|---|
chembl | Perform a bioactivity search for any protein of interest using Chembl |
ensembl | Perform Ensembl related queries from their public API. |
uniprot | Query Uniprot directly for gene/protein information. |
ncbi | Query NCBI directly for gene/protein information. |
Quality of Life
These modules improve the quality of life of anybody using a terminal.
Module Name | Description |
---|---|
autocomplete | Generates an autocompletion file for your shell of choice for a better terminal experience. |
Enrichr
This module allows for gene set enrichment analysis using Enrichr as well as other methods to explore gene sets provided within the service.
This tool currently has two submodules.
Module | Description |
---|---|
enrichr | Performs a gene-set enrichment analysis |
list | Lists and explores available background libraries |
Enrichr
Perform an enrichment analysis on a list of genes using Enrichr.
This requires at minimum a database name (listed here) and any number of gene symbols to perform enrichment analysis on.
Library Shorthands
Some shorthands for the library are built into the program for convenience. These can be used in the command line interface or in the python interface.
Alias | Library |
---|---|
pathway | KEGG_2021_Human |
transcription | ChEA_2016 |
ontology | GO_Biological_Processes_2021 |
diseases_drugs | GWAS_Catalog_2019 |
celltypes | PangloaDB_Augmented_2021 |
kinase_interactions | KEA_2015 |
Arguments
Name | Short | Long | Description |
---|---|---|---|
Library | -l | --library | a library shorthand or any Enrichr library |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Interface
# Perform an enrichment analysis using Enrichr
ggetrs enrichr enrichr -l GO_Biological_Process_2021 AP2S1 NSD1 RFX3
# Perform an enrichment analysis with a shorthand library
# this is equivalent to the above search
ggetrs enrichr enrichr -l ontology AP2S1 NSD1 RFX3
# Perform an enrichment analysis on pathway
ggetrs enrichr enrichr -l pathway AP2S1 NSD1 RFX3
Python
import ggetrs
# Search using the ontology shorthand
ggetrs.enrichr("ontology", ["AP2S1", "RFX3", "NSD1"])
# Search using the kinase_interactions shorthand
ggetrs.enrichr("kinase_interactions", ["AP2S1", "RFX3", "NSD1"])
List
Lists available libraries and their statistics available on Enrichr.
Arguments
Name | Short | Long | Description |
---|---|---|---|
Minimal | -m | --minimal | Return only library names in results |
List Categories | -t | --list-categories | List the categorization of libraries |
Categories | -c | --category | Filter libraries with a specified category ID |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Usage
# List all available libraries and their metadata
ggetrs enrichr list
# List all available libraries and their metadata in a minimal format
ggetrs enrichr list -m
# List the categorization of libraries
ggetrs enrichr list -t
# Filter libraries an their metadata that belong to a category ID
# example category ID 2 = pathways
ggetrs enrichr list -c 2
# Filter libraries and print in a minimal format
ggetrs enrichr list -c 2 -m
ARCHS4
Queries gene-specific information from the ARCHS4 database.
This tool currently has two submodules.
Module | Description |
---|---|
correlate | Performs a gene-correlation analysis |
tissue | Performs a tissue-enrichment analysis |
Correlate
Performs a gene-correlation analysis using ARCHS4.
Arguments
Name | Short | Long | Description |
---|---|---|---|
Count | -c | --count | number of values to recover [default: 100] |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Interface
# Perform a gene-correlation analysis with ARCHS4
ggetrs archs4 correlate AP2S1
# Perform a gene-correlation analysis with ARCHS4
# But only return the top 10 results
ggetrs archs4 correlate -c 10 AP2S1
Python
import ggetrs
# Perform a gene-correlation analysis for AP2S1
# and return the top 10 results
ggetrs.archs4.correlate("AP2S1", 10)
# Perform a gene-correlation analysis for AP2S1
# and return the top 100 results
ggetrs.archs4.correlate("AP2S1", 100)
Tissue
Performs a tissue-correlation analysis using ARCHS4.
Arguments
Name | Short | Long | Description |
---|---|---|---|
Species | -s | --species | species of organism to recover [default: human] [possible values: human, mous] |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Interface
# Find tissue-level expression for AP2S1 in Humans
ggetrs archs4 tissue AP2S1
# Find tissue-level expression for AP2S1 in Mice
ggetrs archs4 tissue -s mouse AP2S1
Python
import ggetrs
# perform a tissue-correlation analysis for AP2S1 in Humans
ggetrs.archs4.tissue("AP2S1", "human")
# perform a tissue-correlation analysis for AP2S1 in Mice
ggetrs.archs4.tissue("AP2S1", "mouse")
BLAST
Help
The BLAST program can be determined from the provided input (will assign
either blastn
or blastp
) and the appropriate database will be used:
nt
and nr
respectively.
You may override these though by using their argument flags. Keep in mind that there is no logic built into validating your inputs. All non-default arguments will be passed to the BLAST API as is.
Arguments
Name | Short | Long | Description |
---|---|---|---|
Program | -p | --program | blast program to use [possible values: blastn, blastp, blastx, tblastn, tblastx] |
Database | -d | --database | blast database to use [possible values: nt, nr, refseq-rna, refseq-protein, swissprot, pdbaa, pdbnt] |
Limit | -l | --limit | Number of hits to return [default: 50] |
Expect | -e | --expect | Minimum expected value to consider [default: 10.0] |
Low Complexity Filter | -f | --low-comp-filter | Include flag to use a complexity filter [default = false] |
MEGABLAST | -m | --megablast | Whether to use MEGABLAST (default = false) |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Interface
# Perform BLAST with a nucleotide sequence
ggetrs blast ATACTCAGTCACACAAGCCATAGCAGGAAACAGCGAGCTTGCAGCCTCACCGACGAGTCTCAACTAAAAGGGACTCCCGGAGCTAGGGGTGGGGACTCGGCCTCACACAGTGAGTGCCGG
# Perform BLAST with an amino acid sequence
ggetrs blast MIRFILIQNRAGKTRLAKWYMQFDDDEKQKLIEEVHAVVTVRDAKHTNFVEFRNFKIIYRRYAGLYFCICVDVNDNNLAYLEAIHNFVEVLNEYFHNVCELDLVFNFYKVYTVVDEMFLAGEIRETSQTKVLKQLLMLQSLE
# Perform BLAST with an amino acid sequence using the PDBAA database
ggetrs blast -d pdbaa MIRFILIQNRAGKTRLAKWYMQFDDDEKQKLIEEVHAVVTVRDAKHTNFVEFRNFKIIYRRYAGLYFCICVDVNDNNLAYLEAIHNFVEVLNEYFHNVCELDLVFNFYKVYTVVDEMFLAGEIRETSQTKVLKQLLMLQSLE
Python
import ggetrs
# Perform BLAST with a nucleotide sequence
ggetrs.blast(
"ATACTCAGTCACACAAGCCATAGCAGGAAACAGCGAGCTTGCAGCCTCACCGACGAGTCTCAACTAAAAGGGACTCCCGGAGCTAGGGGTGGGGACTCGGCCTCACACAGTGAGTGCCGG"
)
# Perform BLAST with an amino acid sequence
ggetrs.blast(
"MIRFILIQNRAGKTRLAKWYMQFDDDEKQKLIEEVHAVVTVRDAKHTNFVEFRNFKIIYRRYAGLYFCICVDVNDNNLAYLEAIHNFVEVLNEYFHNVCELDLVFNFYKVYTVVDEMFLAGEIRETSQTKVLKQLLMLQSLE"
)
# Perform BLAST with an amino acid sequence using the PDBAA database
ggetrs.blast(
"MIRFILIQNRAGKTRLAKWYMQFDDDEKQKLIEEVHAVVTVRDAKHTNFVEFRNFKIIYRRYAGLYFCICVDVNDNNLAYLEAIHNFVEVLNEYFHNVCELDLVFNFYKVYTVVDEMFLAGEIRETSQTKVLKQLLMLQSLE",
database = "pdbaa"
)
# Perform BLAST with an amino acid sequence using the PDBAA database with a low complexity filter and a limit
ggetrs.blast(
"MIRFILIQNRAGKTRLAKWYMQFDDDEKQKLIEEVHAVVTVRDAKHTNFVEFRNFKIIYRRYAGLYFCICVDVNDNNLAYLEAIHNFVEVLNEYFHNVCELDLVFNFYKVYTVVDEMFLAGEIRETSQTKVLKQLLMLQSLE",
database = "pdbaa",
limit = 10,
low_comp_filter=True,
)
Search
Searches through descriptions on ENSEMBL using free-form search terms.
Arguments
Name | Short | Long | Description |
---|---|---|---|
Database | -d | --database | Name of Ensembl database to use. |
Species | -s | --species | Species used in database [default: homo_sapiens] |
Database Type | -t | --db-type | Database type specified by Ensembl [default: core] |
Release | -r | --release | release version number to use for database |
Assembly | -a | --assembly | Assembly to use for species [default: 38] |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Interface
# searches Ensembl for all genes with `clathrin` in the description
ggetrs search clathrin
# searches Ensembl for all genes with `clathrin` OR `heavy` in the description
ggetrs search clathrin heavy
# searchs Ensembl for all genes with `clathrin heavy` in the description
ggetrs search "clathrin heavy"
Python
import ggetrs
# searches Ensembl for all genes with `clathrin` in the description
ggetrs.search(["clathrin"])
# searches Ensembl for all genes with `clathrin` or `heavy` in the description
ggetrs.search(["clathrin", "heavy"])
# searchs Ensembl for all genes with `clathrin heavy` in the description
ggetrs.search(["clathrin heavy"])
Info
Fetch extensive gene and transcript metadata from Ensembl, Uniprot, and NCBI.
Arguments
Name | Short | Long | Description |
---|---|---|---|
Species | -s | --species | Species name to use: currently this MUST match the taxon_id [default: homo_sapiens] |
Taxon ID | -t | --taxon-id | Taxon ID to use: currently this MUST match the taxon_id [default: homo_sapiens] |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Usage
# Queries information for single term
ggetrs info AP2S1
# Queries information for multiple terms
ggetrs info AP2S1 RFX3 NSD1
Python
import ggetrs
# Queries information for single term
ggetrs.info(["AP2S1"])
# Queries information for multiple terms
ggetrs.info(["AP2S1", "RFX3", "NSD1"])
Seq
Returns nucleotide or amino acid sequence for a provided ensembl ID or gene symbol.
If gene symbols are provided instead of ensembl IDs for nucleotide sequences
those symbols will first be matched to an ensembl ID with the same functionality
of ggetrs ensembl lookup-symbol
.
All returned sequences are guaranteed to be in the same order as provided ids/symbols.
Arguments
Name | Short | Long | Description |
---|---|---|---|
Translate | -t | --translate | Return the amino acid sequence instead of nucleotide sequence |
Species | -s | --species | Species to specify when not using an Ensembl ID [default: homo_sapiens] |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Interface
# recover nucleotide sequence for AP2S1 (ENSG00000042753)
ggetrs seq ENSG00000042753
# recover nucleotide sequence for AP2S1
ggetrs seq AP2S1
# recover nucleotide sequence for AP2S1 (ENSG00000042753) and NSD1
ggetrs seq ENSG00000042753 NSD1
# recover amino acid sequence for AP2S1 (ENSG00000042753)
ggetrs seq -t ENSG00000042753
# recover amino acid sequence for AP2S1
ggetrs seq -t AP2S1
# recover amino acid sequences for AP2S1 and NSD1 and RFX3
ggetrs seq -t AP2S1 NSD1 RFX3
Python
import ggetrs
# recover nucleotide sequence for AP2S1 (ENSG00000042753)
ggetrs.seq(["ENSG00000042753"])
# recover nucleotide sequence for AP2S1
ggetrs.seq(["AP2S1"])
# recover nucleotide sequence for AP2S1 (ENSG00000042753) and NSD1
ggetrs.seq(["ENSG00000042753", "NSD1"])
# recover amino acid sequence for AP2S1 (ENSG00000042753)
ggetrs.seq(["ENSG00000042753"], translate=True)
# recover amino acid sequence for AP2S1
ggetrs.seq(["AP2S1"], translate=True)
# recover amino acid sequences for multiple transcripts
ggetrs.seq(["AP2S1", "NSD1", "RFX3"], translate=True)
UCSC
This module is used to interact with the UCSC genome browser. Currently there is only the BLAT API which is implemented.
Module | Description |
---|---|
blat | Performs a BLAT sequence search on a provided database |
BLAT
Perform a BLAT search using the UCSC Genome Browser.
Arguments
Name | Short | Long | Description |
---|---|---|---|
Sequence Type | -s | --seqtype | Specify the structure format [default: dna] [possible values: dna, protein, translated-rna, translated-dna] |
Database Name | -d | --db-name | Specifies the database name to query [default: hg38] |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Interface
# query UCSC genome browser for the first 121 bp of AP2S1
ggetrs ucsc blat GGGCCCTACAACTGCACCCTGAGCCGGAGCTGCCCAGTCGCCGCGGGACCGGGGCCGCTGGGGTCTGGACGGGGGTCGCCATGGTAACGGGGGAGCGCTACGCCGGGGACTGGCGGAGGG
Python
import ggetrs
# query UCSC genome browser for the first 121 bp of AP2S1
ggetrs.ucsc.blat(
"GGGCCCTACAACTGCACCCTGAGCCGGAGCTGCCCAGTCGCCGCGGGACCGGGGCCGCTGGGGTCTGGACGGGGGTCGCCATGGTAACGGGGGAGCGCTACGCCGGGGACTGGCGGAGGG"
)
# query UCSC genome browser with amino acid sequence
ggetrs.ucsc.blat(
"MIRFILIQNRAGKTRLAKWYMQFDDDEKQKLIEEVHAVVTVRDAKHTNFVEFRNFKIIYRRYAGLYFCICVDVNDNNLAYLEAIHNFVEVLNEYFHNVCELDLVFNFYKVYTVVDEMFLAGEIRETSQTKVLKQLLMLQSLE",
seqtype="protein"
)
PDB
Get structure and metadata of a protein from the RCSB Protein Data Bank
There are currently two submodules in PDB:
Module | Description |
---|---|
structure | Retrieves PDB structure for a provided RCSB ID |
info | Retrieves Protein information for a provided RCSB ID |
Structure
Retrieves pdb structure for a provided ID
Arguments
Name | Short | Long | Description |
---|---|---|---|
Header Only | -m | --header-only | Retrieve only the PDB Header |
Format | -f | --format | Specify the structure format [default: pdb] [possible values : pdb, cif] |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Usage
# return the pdb structure for AP2S1 (6URI)
ggetrs pdb structure 6URI
# return the pdb structure for AP2S1 (6URI) as a `.cif`
ggetrs pdb structure -f cif 6URI
# return the header for AP2S1 (6URI)
ggetrs pdb structure -m 6URI
Info
Retrieves pdb information for a provided ID and resource
Arguments
Name | Short | Long | Description |
---|---|---|---|
Resource | -r | --resource | Specify the structure format [default: entry] [possible values: entry, pubmed, assembly, branched-entity, nonpolymer-entity, polymer-entity, uniprot, branched-entity-instance, polymer-entity-instance, nonpolymer-entity-instance] |
Identifier | -i | --identifier | Specifies the Entry or Chain Identifier |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Interface
# return information for AP2S1 (6URI)
ggetrs pdb info 6URI
Chembl
This module is used to query the Chembl database.
Currently the query-APIs available are:
Module | Description |
---|---|
activity | Queries for checmical bioactivity for a provided protein-target. |
Activity
Queries chemical bioactivity for a provided protein target and return all small molecules.
Arguments
Name | Short | Long | Description |
---|---|---|---|
Limit | -l | --limit | Number of results to return [default: 500] |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Usage
# Query the Chembl database for small molecules with bioactivity targeting NSD1
ggetrs chembl activity NSD1
# Query the Chembl database for the top 20 bioactive molecules for NSD1
ggetrs chembl activity -l 20 NSD1
Ensembl
This is a collection of modules to query the ensembl database.
Currently there are the following APIs covered:
Submodule | Description |
---|---|
search | Searches through descriptions on ENSEMBL |
database | Prints all available databases on Ensembl's SQL database |
lookup-id | Lookup information for genes/transcripts providing ensembl ids |
lookup-symbol | Lookup information for genes/transcripts providing symbols and species |
release | Retrieves the latest ensembl release version |
ref | Retrieves reference files from Ensembl FTP site |
species | Retrieves the list of species from ENSEMBL FTP site |
Database
Prints all available databases on Ensembl's SQL server.
This is used if you are interested in querying a specific database and can be passed into ggetrs search
.
Arguments
Name | Short | Long | Description |
---|---|---|---|
Filter | -f | --filter | Provides a substring filter to only return databases which contain the substring |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Interface
# show all databases in the SQL server
ggetrs ensembl database
# filter for databases with the `sapiens` substring
ggetrs ensembl database -f sapiens
# filter for databases with the `cerevisiae` substring
ggetrs ensembl database -f cerevisiae
Python
import ggetrs
# show all databases in the SQL server
ggetrs.ensembl.database()
# filter for databases with the `sapiens` substring
ggetrs.ensembl.database("sapiens")
# filter for databases with the `cerevisiae` substring
ggetrs.ensembl.database("cerevisiae")
Lookup-Id
Lookup information for genes/transcripts providing ensembl ids
Arguments
Name | Short | Long | Description |
---|---|---|---|
Names | -n | --names | Returns a minimal output of only the found gene names |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Usage
# Query information for AP2S1 (ENSG00000042753)
ggetrs ensembl lookup-id ENSG00000042753
# Query Information for AP2S1 (ENSG00000042753) and NSD1 (ENSG00000165671)
ggetrs ensembl lookup-id ENSG00000042753 ENSG00000165671
# Query information for AP2S1 (ENSG00000042753) and NSD1 (ENSG00000165671)
# but only return their found gene names
# (useful for translating between ensembl IDs and gene symbols)
ggetrs ensembl lookup-id -n ENSG00000042753 ENSG00000165671
Lookup-Symbol
Lookup information for genes/transcripts providing symbols and species
Arguments
Name | Short | Long | Description |
---|---|---|---|
Species | -s | --species | Species to specify [default: homo_sapiens] |
IDs | -i | --ids | Return a minimal output of only the found Ensembl IDs |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Usage
# Query information for AP2S1
ggetrs ensembl lookup-symbol AP2S1
# Query information for AP2S1 and NSD1
ggetrs ensembl lookup-symbol AP2S1 NSD1
# Query information for AP2S1 and NSD1 in mice
ggetrs ensembl lookup-symbol -s mus_musculus AP2S1 NSD1
# Query information for AP2S1 and NSD1 but only return Ensembl IDs
# (useful for translating between Ensembl IDs and gene symbols)
ggetrs ensembl lookup-symbol -i AP2S1 NSD1
Ref
Retrieves reference files from the Ensembl FTP site.
Help
Name | Short | Long | Description |
---|---|---|---|
Species | -s | --species | Species to query data for [default: homo_sapiens] |
Release | -r | --release | Release to use - will default to latest release |
Data Type | -d | --datatype | Datatype to query for - provided as a comma-separated list |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Download | -D | --download | Download all requested files to the current working directory |
Command Line Interface
# returns the url for human genome (default)
ggetrs ensembl ref
# returns the url for the human cdna transcriptome
ggetrs ensembl ref -d cdna
# returns the url for the human cdna transcriptome and genome
ggetrs ensembl ref -d cdna,dna
# returns the url for the mouse cdna transcriptome and genome
ggetrs ensembl ref -d cdna,dna -s mus_musculus
# downloads the requested files to the current directory
ggetrs ensembl ref -d cdna,dna,gtf -s homo_sapiens
Python
import ggetrs
# returns the url for human genome (default)
ggetrs.ensembl.reference()
# returns the url for the human cdna transcriptome
ggetrs.ensembl.reference(
datatype="cdna"
)
# returns the url for the human cdna transcriptome and genome
ggetrs.ensembl.reference(
datatype=["cdna", "dna"]
)
# returns the url for the mouse cdna transcriptome and genome
ggetrs.ensembl.reference(
datatype=["cdna", "dna"],
species="mus_musculus"
)
Release
Returns the latest Ensembl release version
Command Line Interface
ggetrs ensembl release
Python
import ggetrs
ggetrs.ensembl.release()
Search
This is another way to access ggetrs search
.
Species
Returns the available species in the Ensembl FTP site
Arguments
Name | Short | Long | Description |
---|---|---|---|
Release | -r | --release | Ensembl release version to use - will default to latest release |
Data Type | -d | --datatype | Datatype to query species list [default: dna] [possible values: cdna, cds, dna, gff3, gtf, ncrna, pep] |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Interface
# return all species where there is a genome (i.e. dna)
ggetrs ensembl species
# return all species where there is a transcriptome (i.e. cdna)
ggetrs ensembl species -d cdna
# return all species where there is a transcriptome (i.e. cdna)
# for an older release
ggetrs ensembl species -d cdna -r 60
Python
import ggetrs
# return all species where there is a genome (i.e. dna)
ggetrs.ensembl.species()
# return all species where there is a transcriptome (i.e. cdna)
ggetrs.ensembl.species(dataype="dna")
# return all species where there is a transcriptome (i.e. cdna)
# for an older release
ggetrs.ensembl.species(dataype="cdna", release=60)
Uniprot
This a module for direct querying on the Uniprot database.
Currently there is a single module, but more modules are expected to be created in the future and so this command was created as a submodule.
This provides nearly all information as ggetrs info
but is significantly faster.
Submodule | Description |
---|---|
query | Searches through descriptions on ENSEMBL |
Query
Searches through descriptions on Uniprot
Arguments
Name | Short | Long | Description |
---|---|---|---|
Taxon | -t | --taxon | Taxon to filter results (human: 9606, mouse: 10090) |
Freeform | -f | --freeform | Include flag to perform a freeform search through uniprot. Not including will default to searching for gene symbols. |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Interface
# Query uniprot for single terms
ggetrs uniprot query AP2S1
# Query uniprot for multiple terms
ggetrs uniprot query AP2S1 RFX3 NSD1
# Query uniprot with freeform search
ggetrs uniprot query -f rifin
NCBI
This module allows for direct access to APIs provided by NCBI. Currently the following submodules are provided:
Submodule | Description |
---|---|
taxons | Retrieves taxon information from NCBI from a query string |
query-ids | Retrieves information for a list of IDs |
query-symbols | Retrieves information for a list of symbols (must provide taxon) |
Taxons
This retrieves possible taxons from an incomplete query string.
Arguments
Name | Short | Long | Description |
---|---|---|---|
Limit | -l | --limit | Number of search results to return [default: 5] |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Interface
# return all taxons that contain the substring 'sapiens'
ggetrs ncbi taxons sapiens
# return the first 3 taxons that contain the substring 'sapi'
ggetrs ncbi taxons -l 3 sapi
Query IDs
Retrieves information for a list of NCBI IDs
Arguments
Name | Short | Long | Description |
---|---|---|---|
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Usage
# query NCBI for AP2S1 (NCBI ID: 1175)
ggetrs ncbi query-ids 1175
# query NCBI for AP2S1 and NSD1 (1175 and 64324 respectively)
ggetrs ncbi query-ids 1175 64324
Query Symbols
Query NCBI for gene symbols and with a provided taxon ID.
You can determine taxon IDs for your organism of choice with
ggetrs ncbi taxons
.
Arguments
Name | Short | Long | Description |
---|---|---|---|
Taxon ID | -t | --taxon-id | Taxon ID (human: 9606, mouse: 10090) [default: 9606] |
Output | -o | --output | optional filepath to write output to [default=stdout] |
Command Line Interface
# query NCBI for the symbol AP2S1
ggetrs ncbi query-symbols AP2S1
# query NCBI for the symbol AP2S1 in mice
ggetrs ncbi query-symbols -t 10090 AP2S1
Autocomplete
This is used to generate autocomplete information for your terminal shell.
Arguments
Name | Short | Long | Description |
---|---|---|---|
Shell | -s | --shell | Shell to generate autocompletions for [possible values: bash, elvish, fish, powershell, zsh] |
Command Line Interface
# generate autocompletions for the fish shell
ggetrs autocomplete -s fish
# write autocomplete directly to fish shell config
ggetrs autocomplete -s fish > ~/.config/fish/completions/ggetrs.fish
FAQ
What makes this different than gget
?
ggetrs
takes advantage of rust's powerful powerful asynchronous features
and lets you perform a large numbers of HTTP requests without increasing wait times.
Since it is a compiled program as well there is no start-up time between commands
and you can run your favorite tool in a for loop
with no overhead.
However ggetrs
stays true to the original gget
mindset and tries to make usage
as simple as possible no matter the interface (from python to CLI).
Does this have functions that gget
doesn't?
We're working on having both tools mirror functionality - but currently this includes the Chembl bioactivity database, more endpoints from the Ensembl API, and direct queries to NCBI and Uniprot.
Does gget
have functions that ggetrs
does not?
ggetrs
will likely not support the ggetrs muscle
and ggetrs alphafold
functionalities for the time being.
The reasoning being that these are wrappers around existing binaries
and not HTTP requests.
Do I need to know rust to use this tool?
This tool is written fully in rust - but allows for a python interface using pyo3. Currently not all tools have a python API - but they are planned to be implemented eventually.
All of the currently supported gget
modules have their python API implemented.
Citations
gget
Luebbert L. & Pachter L. (2022). Efficient querying of genomic reference databases with gget. bioRxiv 2022.05.17.492392; doi: https://doi.org/10.1101/2022.05.17.492392
ARCHS4
Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, Silverstein MC, Ma’ayan A. Massive mining of publicly available RNA-seq data from human and mouse. Nature Communications 9. Article number: 1366 (2018), doi:10.1038/s41467-018-03751-6
Bray NL, Pimentel H, Melsted P and Pachter L, Near optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, p 525--527 (2016). https://doi.org/10.1038/nbt.3519
Enrichr
Chen EY, Tan CM, Kou Y, Duan Q, Wang Z, Meirelles GV, Clark NR, Ma'ayan A. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013; 128(14).
Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, Koplev S, Jenkins SL, Jagodnik KM, Lachmann A, McDermott MG, Monteiro CD, Gundersen GW, Ma'ayan A. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Research. 2016; gkw377.
Xie Z, Bailey A, Kuleshov MV, Clarke DJB., Evangelista JE, Jenkins SL, Lachmann A, Wojciechowicz ML, Kropiwnicki E, Jagodnik KM, Jeon M, & Ma’ayan A. Gene set knowledge discovery with Enrichr. Current Protocols, 1, e90. 2021. doi: 10.1002/cpz1.90.
BLAST
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. doi: 10.1016/S0022-2836(05)80360-2. PMID: 2231712.
BLAT
Kent WJ. BLAT--the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. doi: 10.1101/gr.229202. PMID: 11932250; PMCID: PMC187518.
Contributing
This project is intended to be open-source and contributions are very welcome!
Please contribute on the github repo
If you are new to rust or open source in general but still want to contribute please don't hesitate to reach out! I would be more than happy to help guide you through building your first module.
All new additions must pass and follow current testing standards.
Contribution Flow Chart
-
Open an issue describing what you would like to add as a feature/problem you'd like to fix
-
Fork the repository / create a branch
-
Make the changes you'd like to the branch (Commit frequently and describe the changes in your commit messages)
-
Open a pull request to main
ggetrs
repo and request a review
Bug Reporting
There will be bugs and I'll work on fixing them.
But please if you run into anything that seems off don't hesitate to create an issue on the repo.
Please be detailed as possible in your issue when describing the bug including providing a minimal reproducible example.
i.e. if your command didn't work provide the exact command you used.
Disclaimer
This tool is provided caveat emptor and relies on the public databases that it draws from. If you have any issues with the resulting data - please refer your rage to the relevant providers and don't shoot the messenger :)