[ crispr_screen ]

MIT licensed actions status codecov

Introduction

crispr_screen is a free, open-source command-line tool that enables easy and efficient differential expression analysis for CRISPR screens.

It is a recreation of the MAGeCK algorithm, which is a form of the DESeq method specifically applied for CRISPR screens. This tool extends the MAGeCK algorithm with the MAGeCK-INC method, but also implements the αRRA algorithm described in the original MAGeCK paper.

It is a drop-in replacement of current differential expression analyses that is faster, more stable, and easier to use.

Installation

Installing rust

If you don't already have the rust package manager cargo installed:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Installing crispr_screen

cargo install crispr_screen

Usage

Quick Start

To get started immediately check out quick start.

Input and Output Formats

To learn more about the input and output formats check out formats.

Configuration Options

To learn more about configuration options check out customized runs.

Quick Start

Input Files

crispr_screen is meant to be run from the command-line and expects at minimum three arguments:

  1. an sgRNA-Gene count table (see formats for details)
  2. the labels of the controls
  3. the labels of the treatments

For additional options and descriptions see arguments.

Running [ crispr_screen ]

crispr_screen test \
  -i count_table.tsv \
  -c low_1 low_2 \
  -t high_1 high_2

Output Files

This will create three results files:

  1. results.sgrna_results.tsv
  2. results.gene_results.tsv
  3. results.hits.tsv

With differential expression statistics for each respectively and a table for just the genes that are found to be significant (for descriptions of files see formats).

Expected Formats

Inputs

Count Table

In the above example, the count_table.tsv is a tab-separated file of the following form:

sgrnagenelow_1high_1low_2high_2
sgrna.0gene.01426
sgrna.1gene.02634
..................
sgrna.ngene.m412520

Note:

The sample names provided do not need to be in the same order you they appear in the file.

However, the first two columns do need to be an sgRNA and gene column respectively (though they can be named whatever you like.)

If you have extra columns in the table you don't want to analyze, just provide the names of the columns you do want to analyze.

The above command will still work as expected for the following table:

sgrnagenelow_1high_1low_2high_2ext_1
sgrna.0gene.0142620
sgrna.1gene.0263430
.....................
sgrna.ngene.m4125205

Outputs

sgRNA Results

The sgRNA results dataframe (written to <args.output>.sgrna_results.tsv) is a table whose columns are of the following form:

ColumnDescription
sgrnaThe sgRNA name provided in the first column of the count_table.
geneThe gene name provided in the second column of the count_table.
baseThe normalized/aggregated count of the sgRNA across all samples.
controlThe normalized/aggregated count of the sgRNA for the controls.
treatmentThe normalized/aggregated count of the sgRNA for the treatment.
adj_varThe adjusted variance for the sgRNA determined by the least squares fit.
fold_changeThe fold change of the treatment over the controls.
log2_fold_changeThe log2 fold change of the treatment over the controls.
pvalue_lowThe p-value for an depletion of the sgRNA.
pvalue_highThe p-value for an enrichment of the sgRNA.
pvalue_twosidedThe two-sided p-value of an enrichment or depletion of the sgRNA.
fdrThe adjusted false discovery rate of the sgRNA.

Gene Results

The gene results dataframe (written to <args.output>.gene_results.tsv) is a table whose columns are of the following form:

ColumnDescription
geneThe gene name provided in the second column of the count_table.
fold_changeThe aggregated fold change of the treatment from the controls.
log_fold_changeThe log2 aggregated fold change of the treatment from the controls.
score_lowThe minimum p-value observed in the RRA or the U-score observed in the INC for the gene being depleted.
pvalue_lowThe aggregated p-value for a depletion of the gene.
fdr_lowThe false discovery rate for a depletion of the gene.
score_highThe minimum p-value observed in the RRA of the U-score observed in the INC for the gene being enriched.
pvalue_highThe aggregated p-value for an enrichment of the gene.
fdr_highThe false discovery rate for an enrichment of the gene.
pvalueThe minimum pvalue observed with either test.
fdrThe minimum false discovery rate observed with either test.
phenotype_scoreThe -log10 FDR multiplied by the log2 fold change of the gene.

Note: If you ran crispr_screen with INC

The FDR in this case will be the empirical false discovery rate given the non-targeting controls. As a result it will not be strictly monotonic with the p-values. I recommend working with the pvalues in this case and paying attention to the calculated thresholds of the FDR given in the workflow log.

Hit Results

The hits dataframe (written to <args.output>.hits.tsv) is a table whose columns are of the following form:

ColumnDescription
geneThe gene name provided in the second column of the count_table.
log2fcThe log2 aggregated fold change of the treatment from the controls.
pvalueThe minimum p-value observed in the aggregation test (minimum of both sides).
phenotype_scoreThe product of the log2fc and the -log10(pvalue).
fdrThe calculated false discovery rate (only shown if running $\alpha$-RRA).

Command Arguments

Subcommands

crispr_screen has two subcommands:

  1. test
  2. agg

test is used to perform the sgRNA-level differential abundance tests and then aggregate the results to the gene-level. test by default will perform a gene-level aggregation as well, but can be skipped with the --skip-agg flag.

agg is used to just perform the gene-level aggregation on a precalculated differential abundance matrix. Take a look at the results.sgrna.tsv file to see the expected file format required. Column names can be provided as well - details can be found by running crispr_screen agg --help

Arguments

Required

The three required arguments are as follows:

ArgumentDescription
inputFilepath of the input count matrix
controlsLabels for the control samples (can take multiple space-separated values)
treatmentsLabels for the treatment samples (can take multiple space-separated values)

Optional Arguments

These arguments are optional and may change the configuration of the analysis.

For further details on them please run:

crispr_screen test --help
ArgumentDescription
outputPrefix of the output sgRNA and gene result dataframes
normNormalization method to use
aggGene aggregation method to use
correctionMultiple hypothesis correction to use
model-choiceWhich least squares model to fit
alphaThe alpha threshold parameter for aRRA algorithm
permutationsThe number of permutations to perform in aRRA algorithm
no-adjust-alphaUse flag to have fixed alpha, otherwise an empirical one will be calculated from provided alpha.
ntc-tokenThe token string to search for non-targeting controls (if INC)

Methods

Overview

The [ crispr_screen ] pipeline is really two analyses in one.

There is first the sgRNA enrichment analysis, and second the sgRNA aggregation analysis.

The reasoning for this is that we want to be able to determine which sgRNAs are statistically enriched/depleted in the treatments given the controls from their abundances. However, because there are generally such a high number of sgRNAs, there will be likely be many sgRNA hits that cross spuriously either statistically or biologically. If we were to just say that any gene with a statistically significant sgRNA was significantly enriched or depleted we will likely not end up with a set of genes which are robust in our assay.

sgRNA Enrichment

An overview of the sgRNA enrichment analysis is as follows:

sgRNA Aggregation

An overview of the sgRNA aggregation procedure is as follows:

Normalization

// TODO

Dispersion

// TODO

Enrichment

// TODO

aRRA

// TODO

INC

// TODO

Correction

// TODO