[ crispr_screen
]
Introduction
crispr_screen
is a free, open-source command-line tool that enables easy and
efficient differential expression analysis for CRISPR screens.
It is a recreation of the MAGeCK
algorithm, which is a form of the DESeq
method specifically applied for CRISPR screens. This tool extends the
MAGeCK
algorithm with the MAGeCK-INC
method, but also implements the αRRA
algorithm described in the original MAGeCK
paper.
It is a drop-in replacement of current differential expression analyses that is faster, more stable, and easier to use.
Installation
Installing rust
If you don't already have the rust package manager cargo
installed:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Installing crispr_screen
cargo install crispr_screen
Usage
Quick Start
To get started immediately check out quick start.
Input and Output Formats
To learn more about the input and output formats check out formats.
Configuration Options
To learn more about configuration options check out customized runs.
Quick Start
Input Files
crispr_screen
is meant to be run from the command-line and expects at minimum
three arguments:
- an sgRNA-Gene count table (see formats for details)
- the labels of the controls
- the labels of the treatments
For additional options and descriptions see arguments.
Running [ crispr_screen
]
crispr_screen test \
-i count_table.tsv \
-c low_1 low_2 \
-t high_1 high_2
Output Files
This will create three results files:
results.sgrna_results.tsv
results.gene_results.tsv
results.hits.tsv
With differential expression statistics for each respectively and a table for just the genes that are found to be significant (for descriptions of files see formats).
Expected Formats
Inputs
Count Table
In the above example, the count_table.tsv
is a tab-separated file of the
following form:
sgrna | gene | low_1 | high_1 | low_2 | high_2 |
---|---|---|---|---|---|
sgrna.0 | gene.0 | 1 | 4 | 2 | 6 |
sgrna.1 | gene.0 | 2 | 6 | 3 | 4 |
... | ... | ... | ... | ... | ... |
sgrna.n | gene.m | 4 | 12 | 5 | 20 |
Note:
The sample names provided do not need to be in the same order you they appear in the file.
However, the first two columns do need to be an sgRNA and gene column respectively (though they can be named whatever you like.)
If you have extra columns in the table you don't want to analyze, just provide the names of the columns you do want to analyze.
The above command will still work as expected for the following table:
sgrna | gene | low_1 | high_1 | low_2 | high_2 | ext_1 |
---|---|---|---|---|---|---|
sgrna.0 | gene.0 | 1 | 4 | 2 | 6 | 20 |
sgrna.1 | gene.0 | 2 | 6 | 3 | 4 | 30 |
... | ... | ... | ... | ... | ... | ... |
sgrna.n | gene.m | 4 | 12 | 5 | 20 | 5 |
Outputs
sgRNA Results
The sgRNA results dataframe (written to <args.output>.sgrna_results.tsv
) is a
table whose columns are of the following form:
Column | Description |
---|---|
sgrna | The sgRNA name provided in the first column of the count_table . |
gene | The gene name provided in the second column of the count_table . |
base | The normalized/aggregated count of the sgRNA across all samples. |
control | The normalized/aggregated count of the sgRNA for the controls. |
treatment | The normalized/aggregated count of the sgRNA for the treatment. |
adj_var | The adjusted variance for the sgRNA determined by the least squares fit. |
fold_change | The fold change of the treatment over the controls. |
log2_fold_change | The log2 fold change of the treatment over the controls. |
pvalue_low | The p-value for an depletion of the sgRNA. |
pvalue_high | The p-value for an enrichment of the sgRNA. |
pvalue_twosided | The two-sided p-value of an enrichment or depletion of the sgRNA. |
fdr | The adjusted false discovery rate of the sgRNA. |
Gene Results
The gene results dataframe (written to <args.output>.gene_results.tsv
) is a
table whose columns are of the following form:
Column | Description |
---|---|
gene | The gene name provided in the second column of the count_table . |
fold_change | The aggregated fold change of the treatment from the controls. |
log_fold_change | The log2 aggregated fold change of the treatment from the controls. |
score_low | The minimum p-value observed in the RRA or the U-score observed in the INC for the gene being depleted. |
pvalue_low | The aggregated p-value for a depletion of the gene. |
fdr_low | The false discovery rate for a depletion of the gene. |
score_high | The minimum p-value observed in the RRA of the U-score observed in the INC for the gene being enriched. |
pvalue_high | The aggregated p-value for an enrichment of the gene. |
fdr_high | The false discovery rate for an enrichment of the gene. |
pvalue | The minimum pvalue observed with either test. |
fdr | The minimum false discovery rate observed with either test. |
phenotype_score | The -log10 FDR multiplied by the log2 fold change of the gene. |
Note: If you ran
crispr_screen
withINC
The FDR in this case will be the empirical false discovery rate given the non-targeting controls. As a result it will not be strictly monotonic with the p-values. I recommend working with the pvalues in this case and paying attention to the calculated thresholds of the FDR given in the workflow log.
Hit Results
The hits dataframe (written to <args.output>.hits.tsv
) is a
table whose columns are of the following form:
Column | Description |
---|---|
gene | The gene name provided in the second column of the count_table . |
log2fc | The log2 aggregated fold change of the treatment from the controls. |
pvalue | The minimum p-value observed in the aggregation test (minimum of both sides). |
phenotype_score | The product of the log2fc and the -log10(pvalue) . |
fdr | The calculated false discovery rate (only shown if running $\alpha$-RRA). |
Command Arguments
Subcommands
crispr_screen
has two subcommands:
test
agg
test
is used to perform the sgRNA-level differential abundance tests and then aggregate the results to the gene-level.
test
by default will perform a gene-level aggregation as well, but can be skipped with the --skip-agg
flag.
agg
is used to just perform the gene-level aggregation on a precalculated differential abundance matrix.
Take a look at the results.sgrna.tsv
file to see the expected file format required. Column names can
be provided as well - details can be found by running crispr_screen agg --help
Arguments
Required
The three required arguments are as follows:
Argument | Description |
---|---|
input | Filepath of the input count matrix |
controls | Labels for the control samples (can take multiple space-separated values) |
treatments | Labels for the treatment samples (can take multiple space-separated values) |
Optional Arguments
These arguments are optional and may change the configuration of the analysis.
For further details on them please run:
crispr_screen test --help
Argument | Description |
---|---|
output | Prefix of the output sgRNA and gene result dataframes |
norm | Normalization method to use |
agg | Gene aggregation method to use |
correction | Multiple hypothesis correction to use |
model-choice | Which least squares model to fit |
alpha | The alpha threshold parameter for aRRA algorithm |
permutations | The number of permutations to perform in aRRA algorithm |
no-adjust-alpha | Use flag to have fixed alpha, otherwise an empirical one will be calculated from provided alpha. |
ntc-token | The token string to search for non-targeting controls (if INC) |
Methods
Overview
The [ crispr_screen
] pipeline is really two analyses in one.
There is first the sgRNA enrichment analysis, and second the sgRNA aggregation analysis.
The reasoning for this is that we want to be able to determine which sgRNAs are statistically enriched/depleted in the treatments given the controls from their abundances. However, because there are generally such a high number of sgRNAs, there will be likely be many sgRNA hits that cross spuriously either statistically or biologically. If we were to just say that any gene with a statistically significant sgRNA was significantly enriched or depleted we will likely not end up with a set of genes which are robust in our assay.
sgRNA Enrichment
An overview of the sgRNA enrichment analysis is as follows:
- Normalize the read counts
- Fit the dispersion model
- Perform sgRNA-level enrichment
- Adjust p-values for multiple hypothesis correction
sgRNA Aggregation
An overview of the sgRNA aggregation procedure is as follows:
- Perform sgRNA aggregation (either
αRRA
orINC
) - Adjust p-values for multiple hypothesis correction
Normalization
// TODO
Dispersion
// TODO
Enrichment
// TODO
aRRA
// TODO
INC
// TODO
Correction
// TODO