Expected Formats

Inputs

Count Table

In the above example, the count_table.tsv is a tab-separated file of the following form:

sgrnagenelow_1high_1low_2high_2
sgrna.0gene.01426
sgrna.1gene.02634
..................
sgrna.ngene.m412520

Note:

The sample names provided do not need to be in the same order you they appear in the file.

However, the first two columns do need to be an sgRNA and gene column respectively (though they can be named whatever you like.)

If you have extra columns in the table you don't want to analyze, just provide the names of the columns you do want to analyze.

The above command will still work as expected for the following table:

sgrnagenelow_1high_1low_2high_2ext_1
sgrna.0gene.0142620
sgrna.1gene.0263430
.....................
sgrna.ngene.m4125205

Outputs

sgRNA Results

The sgRNA results dataframe (written to <args.output>.sgrna_results.tsv) is a table whose columns are of the following form:

ColumnDescription
sgrnaThe sgRNA name provided in the first column of the count_table.
geneThe gene name provided in the second column of the count_table.
baseThe normalized/aggregated count of the sgRNA across all samples.
controlThe normalized/aggregated count of the sgRNA for the controls.
treatmentThe normalized/aggregated count of the sgRNA for the treatment.
adj_varThe adjusted variance for the sgRNA determined by the least squares fit.
fold_changeThe fold change of the treatment over the controls.
log2_fold_changeThe log2 fold change of the treatment over the controls.
pvalue_lowThe p-value for an depletion of the sgRNA.
pvalue_highThe p-value for an enrichment of the sgRNA.
pvalue_twosidedThe two-sided p-value of an enrichment or depletion of the sgRNA.
fdrThe adjusted false discovery rate of the sgRNA.

Gene Results

The gene results dataframe (written to <args.output>.gene_results.tsv) is a table whose columns are of the following form:

ColumnDescription
geneThe gene name provided in the second column of the count_table.
fold_changeThe aggregated fold change of the treatment from the controls.
log_fold_changeThe log2 aggregated fold change of the treatment from the controls.
score_lowThe minimum p-value observed in the RRA or the U-score observed in the INC for the gene being depleted.
pvalue_lowThe aggregated p-value for a depletion of the gene.
fdr_lowThe false discovery rate for a depletion of the gene.
score_highThe minimum p-value observed in the RRA of the U-score observed in the INC for the gene being enriched.
pvalue_highThe aggregated p-value for an enrichment of the gene.
fdr_highThe false discovery rate for an enrichment of the gene.
pvalueThe minimum pvalue observed with either test.
fdrThe minimum false discovery rate observed with either test.
phenotype_scoreThe -log10 FDR multiplied by the log2 fold change of the gene.

Note: If you ran crispr_screen with INC

The FDR in this case will be the empirical false discovery rate given the non-targeting controls. As a result it will not be strictly monotonic with the p-values. I recommend working with the pvalues in this case and paying attention to the calculated thresholds of the FDR given in the workflow log.

Hit Results

The hits dataframe (written to <args.output>.hits.tsv) is a table whose columns are of the following form:

ColumnDescription
geneThe gene name provided in the second column of the count_table.
log2fcThe log2 aggregated fold change of the treatment from the controls.
pvalueThe minimum p-value observed in the aggregation test (minimum of both sides).
phenotype_scoreThe product of the log2fc and the -log10(pvalue).
fdrThe calculated false discovery rate (only shown if running $\alpha$-RRA).