[ `crispr_screen` ]

Introduction

crispr_screen is a free, open-source command-line tool that enables easy and efficient differential expression analysis for CRISPR screens.

It is a recreation of the MAGeCK algorithm, which is a form of the DESeq method specifically applied for CRISPR screens. This tool extends the MAGeCK algorithm with the MAGeCK-INC method, but also implements the αRRA algorithm described in the original MAGeCK paper.

It is a drop-in replacement of current differential expression analyses that is faster, more stable, and easier to use.

Installation

Installing rust

If you don't already have the rust package manager cargo installed:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Installing `crispr_screen`

cargo install crispr_screen

Usage

Quick Start

To get started immediately check out quick start.

Input and Output Formats

To learn more about the input and output formats check out formats.

Configuration Options

To learn more about configuration options check out customized runs.

Quick Start

Input Files

crispr_screen is meant to be run from the command-line and expects at minimum three arguments:

an sgRNA-Gene count table (see formats for details)
the labels of the controls
the labels of the treatments

For additional options and descriptions see arguments.

Running [ `crispr_screen` ]

crispr_screen test \
  -i count_table.tsv \
  -c low_1 low_2 \
  -t high_1 high_2

Output Files

This will create three results files:

results.sgrna_results.tsv
results.gene_results.tsv
results.hits.tsv

With differential expression statistics for each respectively and a table for just the genes that are found to be significant (for descriptions of files see formats).

Expected Formats

Inputs

Count Table

In the above example, the count_table.tsv is a tab-separated file of the following form:

sgrna	gene	low_1	high_1	low_2	high_2
sgrna.0	gene.0	1	4	2	6
sgrna.1	gene.0	2	6	3	4
...	...	...	...	...	...
sgrna.n	gene.m	4	12	5	20

Note:

The sample names provided do not need to be in the same order you they appear in the file.

However, the first two columns do need to be an sgRNA and gene column respectively (though they can be named whatever you like.)

If you have extra columns in the table you don't want to analyze, just provide the names of the columns you do want to analyze.

The above command will still work as expected for the following table:

sgrna	gene	low_1	high_1	low_2	high_2	ext_1
sgrna.0	gene.0	1	4	2	6	20
sgrna.1	gene.0	2	6	3	4	30
...	...	...	...	...	...	...
sgrna.n	gene.m	4	12	5	20	5

Outputs

sgRNA Results

The sgRNA results dataframe (written to <args.output>.sgrna_results.tsv) is a table whose columns are of the following form:

Column	Description
sgrna	The sgRNA name provided in the first column of the `count_table`.
gene	The gene name provided in the second column of the `count_table`.
base	The normalized/aggregated count of the sgRNA across all samples.
control	The normalized/aggregated count of the sgRNA for the controls.
treatment	The normalized/aggregated count of the sgRNA for the treatment.
adj_var	The adjusted variance for the sgRNA determined by the least squares fit.
fold_change	The fold change of the treatment over the controls.
log2_fold_change	The log2 fold change of the treatment over the controls.
pvalue_low	The p-value for an depletion of the sgRNA.
pvalue_high	The p-value for an enrichment of the sgRNA.
pvalue_twosided	The two-sided p-value of an enrichment or depletion of the sgRNA.
fdr	The adjusted false discovery rate of the sgRNA.

Gene Results

The gene results dataframe (written to <args.output>.gene_results.tsv) is a table whose columns are of the following form:

Column	Description
gene	The gene name provided in the second column of the `count_table`.
fold_change	The aggregated fold change of the treatment from the controls.
log_fold_change	The log2 aggregated fold change of the treatment from the controls.
score_low	The minimum p-value observed in the RRA or the U-score observed in the INC for the gene being depleted.
pvalue_low	The aggregated p-value for a depletion of the gene.
fdr_low	The false discovery rate for a depletion of the gene.
score_high	The minimum p-value observed in the RRA of the U-score observed in the INC for the gene being enriched.
pvalue_high	The aggregated p-value for an enrichment of the gene.
fdr_high	The false discovery rate for an enrichment of the gene.
pvalue	The minimum pvalue observed with either test.
fdr	The minimum false discovery rate observed with either test.
phenotype_score	The -log10 FDR multiplied by the log2 fold change of the gene.

Note: If you ran crispr_screen with INC

The FDR in this case will be the empirical false discovery rate given the non-targeting controls. As a result it will not be strictly monotonic with the p-values. I recommend working with the pvalues in this case and paying attention to the calculated thresholds of the FDR given in the workflow log.

Hit Results

The hits dataframe (written to <args.output>.hits.tsv) is a table whose columns are of the following form:

Column	Description
gene	The gene name provided in the second column of the `count_table`.
log2fc	The log2 aggregated fold change of the treatment from the controls.
pvalue	The minimum p-value observed in the aggregation test (minimum of both sides).
phenotype_score	The product of the `log2fc` and the `-log10(pvalue)`.
fdr	The calculated false discovery rate (only shown if running $\alpha$-RRA).

Command Arguments

Subcommands

crispr_screen has two subcommands:

test
agg

test is used to perform the sgRNA-level differential abundance tests and then aggregate the results to the gene-level. test by default will perform a gene-level aggregation as well, but can be skipped with the --skip-agg flag.

agg is used to just perform the gene-level aggregation on a precalculated differential abundance matrix. Take a look at the results.sgrna.tsv file to see the expected file format required. Column names can be provided as well - details can be found by running crispr_screen agg --help

Arguments

Required

The three required arguments are as follows:

Argument	Description
input	Filepath of the input count matrix
controls	Labels for the control samples (can take multiple space-separated values)
treatments	Labels for the treatment samples (can take multiple space-separated values)

Optional Arguments

These arguments are optional and may change the configuration of the analysis.

For further details on them please run:

crispr_screen test --help

Argument	Description
output	Prefix of the output sgRNA and gene result dataframes
norm	Normalization method to use
agg	Gene aggregation method to use
correction	Multiple hypothesis correction to use
model-choice	Which least squares model to fit
alpha	The alpha threshold parameter for aRRA algorithm
permutations	The number of permutations to perform in aRRA algorithm
no-adjust-alpha	Use flag to have fixed alpha, otherwise an empirical one will be calculated from provided alpha.
ntc-token	The token string to search for non-targeting controls (if INC)

Methods

Overview

The [ crispr_screen ] pipeline is really two analyses in one.

There is first the sgRNA enrichment analysis, and second the sgRNA aggregation analysis.

The reasoning for this is that we want to be able to determine which sgRNAs are statistically enriched/depleted in the treatments given the controls from their abundances. However, because there are generally such a high number of sgRNAs, there will be likely be many sgRNA hits that cross spuriously either statistically or biologically. If we were to just say that any gene with a statistically significant sgRNA was significantly enriched or depleted we will likely not end up with a set of genes which are robust in our assay.

sgRNA Enrichment

An overview of the sgRNA enrichment analysis is as follows:

sgRNA Aggregation

An overview of the sgRNA aggregation procedure is as follows:

Perform sgRNA aggregation (either αRRA or INC)
- αRRA
- INC
Adjust p-values for multiple hypothesis correction

Normalization

// TODO

Dispersion

// TODO

Enrichment

// TODO

aRRA

// TODO

INC

// TODO

Correction

// TODO

CRISPR Screen