[ `ggetrs` ]

Introduction

ggetrs is a free, open-source command-line tool that enables efficient querying of genomic databases. It consists of a collection of separate but interoperable modules, each designed to facilitate one type of database querying in a single line of code.

This is a rust reimplentation of the original python-based program gget and was rewritten to take advantage of rust's powerful HTTP and asynchronous functionality for a faster user experience.

There are some minor syntactic changes between function calls from the original gget and a description for each tool is provided on the modules page.

This tool is written fully in rust - but allows for a python interface using pyo3.

If you have questions please check out the FAQ

Installation

ggetrs can be installed easily using cargo.

See alternative methods and details for python installation

cargo install ggetrs

Module Overview

Here is a list of currently supported modules

External Links

Installation

Installing via Cargo

This can be installed easily through the rust package manager cargo. If you have never used rust before it is easily installed with a single command here.

# install ggetrs from crates.io
cargo install ggetrs

Installing via Github

git clone https://github.com/noamteyssier/ggetrs
cd ggetrs
cargo install --path .

Installing the Python Module

If you are also interested in using the python interface for ggetrs you will first need to install maturin and then install ggetrs.

# clone the repo
git clone https://github.com/noamteyssier/ggetrs
cd ggetrs

# install maturin
pip install maturin

# install ggetrs to your current environment
maturin develop

No conda / venv environment

Currently maturin develop requires a conda or venv environment to be active before installing a python module, but you can install it manually by first building the wheel then manually pip installing the wheel.

# clone the repo
git clone https://github.com/noamteyssier/ggetrs
cd ggetrs

# install maturin
pip install maturin

# build the python wheel
maturin build

# install the python wheel manually
pip install target/wheels/*.whl

Modules

ggetrs currently consists of the following modules:

Functional

These modules perform single line utility functions like performing a gene-set enrichment analysis or returning the PDB structure of an input protein.

Module Name	Description
`enrichr`	Perform an enrichment analysis on a list of genes using Enrichr
`archs4`	Find the most correlated genes to a gene of interest or find the gene's tissue expression atlas using ARCHS4
`blast`	BLAST a nucleotide or amino acid sequence to any BLAST database
`search`	Fetch genes and transcripts from Ensembl using free-form search terms.
`info`	Fetch extensive gene and transcript metadata from Ensembl, Uniprot, and NCBI.
`seq`	Fetch nucleotide or amino acid sequences of genes or transcripts from Ensembl or Uniprot respectively.
`ucsc`	Perform a BLAT search using the UCSC Genome Browser.
`pdb`	Get structure and metadata of a protein from the RCSB Protein Data Bank

Database Queries

These modules perform descriptive searches by querying databases directly using their publicly available APIs.

Module Name	Description
`chembl`	Perform a bioactivity search for any protein of interest using Chembl
`ensembl`	Perform Ensembl related queries from their public API.
`uniprot`	Query Uniprot directly for gene/protein information.
`ncbi`	Query NCBI directly for gene/protein information.

Quality of Life

These modules improve the quality of life of anybody using a terminal.

Module Name	Description
`autocomplete`	Generates an autocompletion file for your shell of choice for a better terminal experience.

Enrichr

This module allows for gene set enrichment analysis using Enrichr as well as other methods to explore gene sets provided within the service.

This tool currently has two submodules.

Module	Description
`enrichr`	Performs a gene-set enrichment analysis
`list`	Lists and explores available background libraries

Enrichr

Perform an enrichment analysis on a list of genes using Enrichr.

This requires at minimum a database name (listed here) and any number of gene symbols to perform enrichment analysis on.

Library Shorthands

Some shorthands for the library are built into the program for convenience. These can be used in the command line interface or in the python interface.

Alias	Library
pathway	KEGG_2021_Human
transcription	ChEA_2016
ontology	GO_Biological_Processes_2021
diseases_drugs	GWAS_Catalog_2019
celltypes	PangloaDB_Augmented_2021
kinase_interactions	KEA_2015

Arguments

Name	Short	Long	Description
Library	`-l`	`--library`	a library shorthand or any Enrichr library
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Interface

# Perform an enrichment analysis using Enrichr
ggetrs enrichr enrichr -l GO_Biological_Process_2021 AP2S1 NSD1 RFX3

# Perform an enrichment analysis with a shorthand library
# this is equivalent to the above search
ggetrs enrichr enrichr -l ontology AP2S1 NSD1 RFX3

# Perform an enrichment analysis on pathway
ggetrs enrichr enrichr -l pathway AP2S1 NSD1 RFX3

Python

import ggetrs

# Search using the ontology shorthand
ggetrs.enrichr("ontology", ["AP2S1", "RFX3", "NSD1"])

# Search using the kinase_interactions shorthand
ggetrs.enrichr("kinase_interactions", ["AP2S1", "RFX3", "NSD1"])

List

Lists available libraries and their statistics available on Enrichr.

Arguments

Name	Short	Long	Description
Minimal	`-m`	`--minimal`	Return only library names in results
List Categories	`-t`	`--list-categories`	List the categorization of libraries
Categories	`-c`	`--category`	Filter libraries with a specified category ID
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Usage

# List all available libraries and their metadata
ggetrs enrichr list

# List all available libraries and their metadata in a minimal format
ggetrs enrichr list -m

# List the categorization of libraries
ggetrs enrichr list -t

# Filter libraries an their metadata that belong to a category ID
# example category ID 2 = pathways
ggetrs enrichr list -c 2

# Filter libraries and print in a minimal format
ggetrs enrichr list -c 2 -m

ARCHS4

Queries gene-specific information from the ARCHS4 database.

This tool currently has two submodules.

Module	Description
`correlate`	Performs a gene-correlation analysis
`tissue`	Performs a tissue-enrichment analysis

Correlate

Performs a gene-correlation analysis using ARCHS4.

Arguments

Name	Short	Long	Description
Count	`-c`	`--count`	number of values to recover [default: 100]
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Interface

# Perform a gene-correlation analysis with ARCHS4
ggetrs archs4 correlate AP2S1

# Perform a gene-correlation analysis with ARCHS4
# But only return the top 10 results
ggetrs archs4 correlate -c 10 AP2S1

Python

import ggetrs

# Perform a gene-correlation analysis for AP2S1
# and return the top 10 results
ggetrs.archs4.correlate("AP2S1", 10)

# Perform a gene-correlation analysis for AP2S1
# and return the top 100 results
ggetrs.archs4.correlate("AP2S1", 100)

Tissue

Performs a tissue-correlation analysis using ARCHS4.

Arguments

Name	Short	Long	Description
Species	`-s`	`--species`	species of organism to recover [default: human] [possible values: human, mous]
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Interface

# Find tissue-level expression for AP2S1 in Humans
ggetrs archs4 tissue AP2S1

# Find tissue-level expression for AP2S1 in Mice
ggetrs archs4 tissue -s mouse AP2S1

Python

import ggetrs

# perform a tissue-correlation analysis for AP2S1 in Humans
ggetrs.archs4.tissue("AP2S1", "human")

# perform a tissue-correlation analysis for AP2S1 in Mice
ggetrs.archs4.tissue("AP2S1", "mouse")

BLAST

Help

The BLAST program can be determined from the provided input (will assign either blastn or blastp) and the appropriate database will be used: nt and nr respectively.

You may override these though by using their argument flags. Keep in mind that there is no logic built into validating your inputs. All non-default arguments will be passed to the BLAST API as is.

Arguments

Name	Short	Long	Description
Program	`-p`	`--program`	blast program to use [possible values: blastn, blastp, blastx, tblastn, tblastx]
Database	`-d`	`--database`	blast database to use [possible values: nt, nr, refseq-rna, refseq-protein, swissprot, pdbaa, pdbnt]
Limit	`-l`	`--limit`	Number of hits to return [default: 50]
Expect	`-e`	`--expect`	Minimum expected value to consider [default: 10.0]
Low Complexity Filter	`-f`	`--low-comp-filter`	Include flag to use a complexity filter [default = false]
MEGABLAST	`-m`	`--megablast`	Whether to use MEGABLAST (default = false)
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Interface

# Perform BLAST with a nucleotide sequence
ggetrs blast ATACTCAGTCACACAAGCCATAGCAGGAAACAGCGAGCTTGCAGCCTCACCGACGAGTCTCAACTAAAAGGGACTCCCGGAGCTAGGGGTGGGGACTCGGCCTCACACAGTGAGTGCCGG

# Perform BLAST with an amino acid sequence
ggetrs blast MIRFILIQNRAGKTRLAKWYMQFDDDEKQKLIEEVHAVVTVRDAKHTNFVEFRNFKIIYRRYAGLYFCICVDVNDNNLAYLEAIHNFVEVLNEYFHNVCELDLVFNFYKVYTVVDEMFLAGEIRETSQTKVLKQLLMLQSLE

# Perform BLAST with an amino acid sequence using the PDBAA database
ggetrs blast -d pdbaa MIRFILIQNRAGKTRLAKWYMQFDDDEKQKLIEEVHAVVTVRDAKHTNFVEFRNFKIIYRRYAGLYFCICVDVNDNNLAYLEAIHNFVEVLNEYFHNVCELDLVFNFYKVYTVVDEMFLAGEIRETSQTKVLKQLLMLQSLE

Python

import ggetrs

# Perform BLAST with a nucleotide sequence
ggetrs.blast(
  "ATACTCAGTCACACAAGCCATAGCAGGAAACAGCGAGCTTGCAGCCTCACCGACGAGTCTCAACTAAAAGGGACTCCCGGAGCTAGGGGTGGGGACTCGGCCTCACACAGTGAGTGCCGG"
)

# Perform BLAST with an amino acid sequence
ggetrs.blast(
  "MIRFILIQNRAGKTRLAKWYMQFDDDEKQKLIEEVHAVVTVRDAKHTNFVEFRNFKIIYRRYAGLYFCICVDVNDNNLAYLEAIHNFVEVLNEYFHNVCELDLVFNFYKVYTVVDEMFLAGEIRETSQTKVLKQLLMLQSLE"
)

# Perform BLAST with an amino acid sequence using the PDBAA database
ggetrs.blast(
  "MIRFILIQNRAGKTRLAKWYMQFDDDEKQKLIEEVHAVVTVRDAKHTNFVEFRNFKIIYRRYAGLYFCICVDVNDNNLAYLEAIHNFVEVLNEYFHNVCELDLVFNFYKVYTVVDEMFLAGEIRETSQTKVLKQLLMLQSLE",
  database = "pdbaa"
)

# Perform BLAST with an amino acid sequence using the PDBAA database with a low complexity filter and a limit
ggetrs.blast(
  "MIRFILIQNRAGKTRLAKWYMQFDDDEKQKLIEEVHAVVTVRDAKHTNFVEFRNFKIIYRRYAGLYFCICVDVNDNNLAYLEAIHNFVEVLNEYFHNVCELDLVFNFYKVYTVVDEMFLAGEIRETSQTKVLKQLLMLQSLE",
  database = "pdbaa",
  limit = 10,
  low_comp_filter=True,
)

Search

Searches through descriptions on ENSEMBL using free-form search terms.

Arguments

Name	Short	Long	Description
Database	`-d`	`--database`	Name of Ensembl database to use.
Species	`-s`	`--species`	Species used in database [default: homo_sapiens]
Database Type	`-t`	`--db-type`	Database type specified by Ensembl [default: core]
Release	`-r`	`--release`	release version number to use for database
Assembly	`-a`	`--assembly`	Assembly to use for species [default: 38]
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Interface

# searches Ensembl for all genes with `clathrin` in the description
ggetrs search clathrin

# searches Ensembl for all genes with `clathrin` OR `heavy` in the description
ggetrs search clathrin heavy

# searchs Ensembl for all genes with `clathrin heavy` in the description
ggetrs search "clathrin heavy"

Python

import ggetrs

# searches Ensembl for all genes with `clathrin` in the description
ggetrs.search(["clathrin"])

# searches Ensembl for all genes with `clathrin` or `heavy` in the description
ggetrs.search(["clathrin", "heavy"])

# searchs Ensembl for all genes with `clathrin heavy` in the description
ggetrs.search(["clathrin heavy"])

Info

Fetch extensive gene and transcript metadata from Ensembl, Uniprot, and NCBI.

Arguments

Name	Short	Long	Description
Species	`-s`	`--species`	Species name to use: currently this MUST match the taxon_id [default: homo_sapiens]
Taxon ID	`-t`	`--taxon-id`	Taxon ID to use: currently this MUST match the taxon_id [default: homo_sapiens]
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Usage

# Queries information for single term
ggetrs info AP2S1

# Queries information for multiple terms
ggetrs info AP2S1 RFX3 NSD1

Python

import ggetrs

# Queries information for single term
ggetrs.info(["AP2S1"])

# Queries information for multiple terms
ggetrs.info(["AP2S1", "RFX3", "NSD1"])

Seq

Returns nucleotide or amino acid sequence for a provided ensembl ID or gene symbol.

If gene symbols are provided instead of ensembl IDs for nucleotide sequences those symbols will first be matched to an ensembl ID with the same functionality of ggetrs ensembl lookup-symbol.

All returned sequences are guaranteed to be in the same order as provided ids/symbols.

Arguments

Name	Short	Long	Description
Translate	`-t`	`--translate`	Return the amino acid sequence instead of nucleotide sequence
Species	`-s`	`--species`	Species to specify when not using an Ensembl ID [default: homo_sapiens]
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Interface

# recover nucleotide sequence for AP2S1 (ENSG00000042753)
ggetrs seq ENSG00000042753

# recover nucleotide sequence for AP2S1
ggetrs seq AP2S1

# recover nucleotide sequence for AP2S1 (ENSG00000042753) and NSD1
ggetrs seq ENSG00000042753 NSD1

# recover amino acid sequence for AP2S1 (ENSG00000042753)
ggetrs seq -t ENSG00000042753

# recover amino acid sequence for AP2S1
ggetrs seq -t AP2S1

# recover amino acid sequences for AP2S1 and NSD1 and RFX3
ggetrs seq -t AP2S1 NSD1 RFX3

Python

import ggetrs

# recover nucleotide sequence for AP2S1 (ENSG00000042753)
ggetrs.seq(["ENSG00000042753"])

# recover nucleotide sequence for AP2S1
ggetrs.seq(["AP2S1"])

# recover nucleotide sequence for AP2S1 (ENSG00000042753) and NSD1
ggetrs.seq(["ENSG00000042753", "NSD1"])

# recover amino acid sequence for AP2S1 (ENSG00000042753)
ggetrs.seq(["ENSG00000042753"], translate=True)

# recover amino acid sequence for AP2S1
ggetrs.seq(["AP2S1"], translate=True)

# recover amino acid sequences for multiple transcripts
ggetrs.seq(["AP2S1", "NSD1", "RFX3"], translate=True)

UCSC

This module is used to interact with the UCSC genome browser. Currently there is only the BLAT API which is implemented.

Module	Description
`blat`	Performs a BLAT sequence search on a provided database

BLAT

Perform a BLAT search using the UCSC Genome Browser.

Arguments

Name	Short	Long	Description
Sequence Type	`-s`	`--seqtype`	Specify the structure format [default: dna] [possible values: dna, protein, translated-rna, translated-dna]
Database Name	`-d`	`--db-name`	Specifies the database name to query [default: hg38]
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Interface

# query UCSC genome browser for the first 121 bp of AP2S1
ggetrs ucsc blat GGGCCCTACAACTGCACCCTGAGCCGGAGCTGCCCAGTCGCCGCGGGACCGGGGCCGCTGGGGTCTGGACGGGGGTCGCCATGGTAACGGGGGAGCGCTACGCCGGGGACTGGCGGAGGG

Python

import ggetrs

# query UCSC genome browser for the first 121 bp of AP2S1
ggetrs.ucsc.blat(
  "GGGCCCTACAACTGCACCCTGAGCCGGAGCTGCCCAGTCGCCGCGGGACCGGGGCCGCTGGGGTCTGGACGGGGGTCGCCATGGTAACGGGGGAGCGCTACGCCGGGGACTGGCGGAGGG"
)

# query UCSC genome browser with amino acid sequence
ggetrs.ucsc.blat(
  "MIRFILIQNRAGKTRLAKWYMQFDDDEKQKLIEEVHAVVTVRDAKHTNFVEFRNFKIIYRRYAGLYFCICVDVNDNNLAYLEAIHNFVEVLNEYFHNVCELDLVFNFYKVYTVVDEMFLAGEIRETSQTKVLKQLLMLQSLE",
  seqtype="protein"
)

PDB

Get structure and metadata of a protein from the RCSB Protein Data Bank

There are currently two submodules in PDB:

Module	Description
`structure`	Retrieves PDB structure for a provided RCSB ID
`info`	Retrieves Protein information for a provided RCSB ID

Structure

Retrieves pdb structure for a provided ID

Arguments

Name	Short	Long	Description
Header Only	`-m`	`--header-only`	Retrieve only the PDB Header
Format	`-f`	`--format`	Specify the structure format [default: pdb] [possible values : pdb, cif]
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Usage

# return the pdb structure for AP2S1 (6URI)
ggetrs pdb structure 6URI

# return the pdb structure for AP2S1 (6URI) as a `.cif`
ggetrs pdb structure -f cif 6URI

# return the header for AP2S1 (6URI)
ggetrs pdb structure -m 6URI

Info

Retrieves pdb information for a provided ID and resource

Arguments

Name	Short	Long	Description
Resource	`-r`	`--resource`	Specify the structure format [default: entry] [possible values: entry, pubmed, assembly, branched-entity, nonpolymer-entity, polymer-entity, uniprot, branched-entity-instance, polymer-entity-instance, nonpolymer-entity-instance]
Identifier	`-i`	`--identifier`	Specifies the Entry or Chain Identifier
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Interface

# return information for AP2S1 (6URI)
ggetrs pdb info 6URI

Chembl

This module is used to query the Chembl database.

Currently the query-APIs available are:

Module	Description
`activity`	Queries for checmical bioactivity for a provided protein-target.

Activity

Queries chemical bioactivity for a provided protein target and return all small molecules.

Arguments

Name	Short	Long	Description
Limit	`-l`	`--limit`	Number of results to return [default: 500]
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Usage

# Query the Chembl database for small molecules with bioactivity targeting NSD1
ggetrs chembl activity NSD1

# Query the Chembl database for the top 20 bioactive molecules for NSD1
ggetrs chembl activity -l 20 NSD1

Ensembl

This is a collection of modules to query the ensembl database.

Currently there are the following APIs covered:

Submodule	Description
`search`	Searches through descriptions on ENSEMBL
`database`	Prints all available databases on Ensembl's SQL database
`lookup-id`	Lookup information for genes/transcripts providing ensembl ids
`lookup-symbol`	Lookup information for genes/transcripts providing symbols and species
`release`	Retrieves the latest ensembl release version
`ref`	Retrieves reference files from Ensembl FTP site
`species`	Retrieves the list of species from ENSEMBL FTP site

Database

Prints all available databases on Ensembl's SQL server.

This is used if you are interested in querying a specific database and can be passed into ggetrs search.

Arguments

Name	Short	Long	Description
Filter	`-f`	`--filter`	Provides a substring filter to only return databases which contain the substring
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Interface

# show all databases in the SQL server
ggetrs ensembl database

# filter for databases with the `sapiens` substring
ggetrs ensembl database -f sapiens

# filter for databases with the `cerevisiae` substring
ggetrs ensembl database -f cerevisiae

Python

import ggetrs

# show all databases in the SQL server
ggetrs.ensembl.database()

# filter for databases with the `sapiens` substring
ggetrs.ensembl.database("sapiens")

# filter for databases with the `cerevisiae` substring
ggetrs.ensembl.database("cerevisiae")

Lookup-Id

Lookup information for genes/transcripts providing ensembl ids

Arguments

Name	Short	Long	Description
Names	`-n`	`--names`	Returns a minimal output of only the found gene names
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Usage

# Query information for AP2S1 (ENSG00000042753)
ggetrs ensembl lookup-id ENSG00000042753

# Query Information for AP2S1 (ENSG00000042753) and NSD1 (ENSG00000165671)
ggetrs ensembl lookup-id ENSG00000042753 ENSG00000165671

# Query information for AP2S1 (ENSG00000042753) and NSD1 (ENSG00000165671)
# but only return their found gene names
# (useful for translating between ensembl IDs and gene symbols)
ggetrs ensembl lookup-id -n ENSG00000042753 ENSG00000165671

Lookup-Symbol

Lookup information for genes/transcripts providing symbols and species

Arguments

Name	Short	Long	Description
Species	`-s`	`--species`	Species to specify [default: homo_sapiens]
IDs	`-i`	`--ids`	Return a minimal output of only the found Ensembl IDs
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Usage

# Query information for AP2S1
ggetrs ensembl lookup-symbol AP2S1

# Query information for AP2S1 and NSD1
ggetrs ensembl lookup-symbol AP2S1 NSD1

# Query information for AP2S1 and NSD1 in mice
ggetrs ensembl lookup-symbol -s mus_musculus AP2S1 NSD1

# Query information for AP2S1 and NSD1 but only return Ensembl IDs
# (useful for translating between Ensembl IDs and gene symbols)
ggetrs ensembl lookup-symbol -i AP2S1 NSD1

Ref

Retrieves reference files from the Ensembl FTP site.

Help

Name	Short	Long	Description
Species	`-s`	`--species`	Species to query data for [default: homo_sapiens]
Release	`-r`	`--release`	Release to use - will default to latest release
Data Type	`-d`	`--datatype`	Datatype to query for - provided as a comma-separated list
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]
Download	`-D`	`--download`	Download all requested files to the current working directory

Command Line Interface

# returns the url for human genome (default)
ggetrs ensembl ref

# returns the url for the human cdna transcriptome
ggetrs ensembl ref -d cdna

# returns the url for the human cdna transcriptome and genome
ggetrs ensembl ref -d cdna,dna

# returns the url for the mouse cdna transcriptome and genome
ggetrs ensembl ref -d cdna,dna -s mus_musculus

# downloads the requested files to the current directory
ggetrs ensembl ref -d cdna,dna,gtf -s homo_sapiens

Python

import ggetrs

# returns the url for human genome (default)
ggetrs.ensembl.reference()

# returns the url for the human cdna transcriptome
ggetrs.ensembl.reference(
  datatype="cdna"
)

# returns the url for the human cdna transcriptome and genome
ggetrs.ensembl.reference(
  datatype=["cdna", "dna"]
)

# returns the url for the mouse cdna transcriptome and genome
ggetrs.ensembl.reference(
  datatype=["cdna", "dna"], 
  species="mus_musculus"
)

Release

Returns the latest Ensembl release version

Command Line Interface

ggetrs ensembl release

Python

import ggetrs
ggetrs.ensembl.release()

Search

This is another way to access ggetrs search.

Species

Returns the available species in the Ensembl FTP site

Arguments

Name	Short	Long	Description
Release	`-r`	`--release`	Ensembl release version to use - will default to latest release
Data Type	`-d`	`--datatype`	Datatype to query species list [default: dna] [possible values: cdna, cds, dna, gff3, gtf, ncrna, pep]
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Interface

# return all species where there is a genome (i.e. dna)
ggetrs ensembl species

# return all species where there is a transcriptome (i.e. cdna)
ggetrs ensembl species -d cdna

# return all species where there is a transcriptome (i.e. cdna)
# for an older release
ggetrs ensembl species -d cdna -r 60

Python

import ggetrs

# return all species where there is a genome (i.e. dna)
ggetrs.ensembl.species()

# return all species where there is a transcriptome (i.e. cdna)
ggetrs.ensembl.species(dataype="dna")

# return all species where there is a transcriptome (i.e. cdna)
# for an older release
ggetrs.ensembl.species(dataype="cdna", release=60)

Uniprot

This a module for direct querying on the Uniprot database.

Currently there is a single module, but more modules are expected to be created in the future and so this command was created as a submodule.

This provides nearly all information as ggetrs info but is significantly faster.

Submodule	Description
`query`	Searches through descriptions on ENSEMBL

Query

Searches through descriptions on Uniprot

Arguments

Name	Short	Long	Description
Taxon	`-t`	`--taxon`	Taxon to filter results (human: 9606, mouse: 10090)
Freeform	`-f`	`--freeform`	Include flag to perform a freeform search through uniprot. Not including will default to searching for gene symbols.
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Interface

# Query uniprot for single terms
ggetrs uniprot query AP2S1

# Query uniprot for multiple terms
ggetrs uniprot query AP2S1 RFX3 NSD1

# Query uniprot with freeform search
ggetrs uniprot query -f rifin

NCBI

This module allows for direct access to APIs provided by NCBI. Currently the following submodules are provided:

Submodule	Description
`taxons`	Retrieves taxon information from NCBI from a query string
`query-ids`	Retrieves information for a list of IDs
`query-symbols`	Retrieves information for a list of symbols (must provide taxon)

Taxons

This retrieves possible taxons from an incomplete query string.

Arguments

Name	Short	Long	Description
Limit	`-l`	`--limit`	Number of search results to return [default: 5]
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Interface

# return all taxons that contain the substring 'sapiens'
ggetrs ncbi taxons sapiens

# return the first 3 taxons that contain the substring 'sapi'
ggetrs ncbi taxons -l 3 sapi

Query IDs

Retrieves information for a list of NCBI IDs

Arguments

Name	Short	Long	Description
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Usage

# query NCBI for AP2S1 (NCBI ID: 1175)
ggetrs ncbi query-ids 1175

# query NCBI for AP2S1 and NSD1 (1175 and 64324 respectively)
ggetrs ncbi query-ids 1175 64324

Query Symbols

Query NCBI for gene symbols and with a provided taxon ID. You can determine taxon IDs for your organism of choice with ggetrs ncbi taxons.

Arguments

Name	Short	Long	Description
Taxon ID	`-t`	`--taxon-id`	Taxon ID (human: 9606, mouse: 10090) [default: 9606]
Output	`-o`	`--output`	optional filepath to write output to [default=stdout]

Command Line Interface

# query NCBI for the symbol AP2S1
ggetrs ncbi query-symbols AP2S1

# query NCBI for the symbol AP2S1 in mice
ggetrs ncbi query-symbols -t 10090 AP2S1

Autocomplete

This is used to generate autocomplete information for your terminal shell.

Arguments

Name	Short	Long	Description
Shell	`-s`	`--shell`	Shell to generate autocompletions for [possible values: bash, elvish, fish, powershell, zsh]

Command Line Interface

# generate autocompletions for the fish shell
ggetrs autocomplete -s fish

# write autocomplete directly to fish shell config
ggetrs autocomplete -s fish > ~/.config/fish/completions/ggetrs.fish

FAQ

What makes this different than `gget`?

ggetrs takes advantage of rust's powerful powerful asynchronous features and lets you perform a large numbers of HTTP requests without increasing wait times.

Since it is a compiled program as well there is no start-up time between commands and you can run your favorite tool in a for loop with no overhead.

However ggetrs stays true to the original gget mindset and tries to make usage as simple as possible no matter the interface (from python to CLI).

Does this have functions that `gget` doesn't?

We're working on having both tools mirror functionality - but currently this includes the Chembl bioactivity database, more endpoints from the Ensembl API, and direct queries to NCBI and Uniprot.

Does `gget` have functions that `ggetrs` does not?

ggetrs will likely not support the ggetrs muscle and ggetrs alphafold functionalities for the time being. The reasoning being that these are wrappers around existing binaries and not HTTP requests.

Do I need to know rust to use this tool?

This tool is written fully in rust - but allows for a python interface using pyo3. Currently not all tools have a python API - but they are planned to be implemented eventually.

All of the currently supported gget modules have their python API implemented.

Citations

gget

Luebbert L. & Pachter L. (2022). Efficient querying of genomic reference databases with gget. bioRxiv 2022.05.17.492392; doi: https://doi.org/10.1101/2022.05.17.492392

ARCHS4

Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, Silverstein MC, Ma’ayan A. Massive mining of publicly available RNA-seq data from human and mouse. Nature Communications 9. Article number: 1366 (2018), doi:10.1038/s41467-018-03751-6

Bray NL, Pimentel H, Melsted P and Pachter L, Near optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, p 525--527 (2016). https://doi.org/10.1038/nbt.3519

Enrichr

Chen EY, Tan CM, Kou Y, Duan Q, Wang Z, Meirelles GV, Clark NR, Ma'ayan A. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013; 128(14).

Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, Koplev S, Jenkins SL, Jagodnik KM, Lachmann A, McDermott MG, Monteiro CD, Gundersen GW, Ma'ayan A. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Research. 2016; gkw377.

Xie Z, Bailey A, Kuleshov MV, Clarke DJB., Evangelista JE, Jenkins SL, Lachmann A, Wojciechowicz ML, Kropiwnicki E, Jagodnik KM, Jeon M, & Ma’ayan A. Gene set knowledge discovery with Enrichr. Current Protocols, 1, e90. 2021. doi: 10.1002/cpz1.90.

BLAST

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. doi: 10.1016/S0022-2836(05)80360-2. PMID: 2231712.

BLAT

Kent WJ. BLAT--the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. doi: 10.1101/gr.229202. PMID: 11932250; PMCID: PMC187518.

Contributing

This project is intended to be open-source and contributions are very welcome!

Please contribute on the github repo

If you are new to rust or open source in general but still want to contribute please don't hesitate to reach out! I would be more than happy to help guide you through building your first module.

All new additions must pass and follow current testing standards.

Contribution Flow Chart

Open an issue describing what you would like to add as a feature/problem you'd like to fix
Fork the repository / create a branch
Make the changes you'd like to the branch (Commit frequently and describe the changes in your commit messages)
Open a pull request to main ggetrs repo and request a review

Bug Reporting

There will be bugs and I'll work on fixing them.

But please if you run into anything that seems off don't hesitate to create an issue on the repo.

Please be detailed as possible in your issue when describing the bug including providing a minimal reproducible example.

i.e. if your command didn't work provide the exact command you used.

Disclaimer

This tool is provided caveat emptor and relies on the public databases that it draws from. If you have any issues with the resulting data - please refer your rage to the relevant providers and don't shoot the messenger :)