[ `How Do I ... ?` ]

Hello!

This is a growing collection of tutorials focused around bioinformatics, analysis, and programming in the life sciences.

Take a look at the sidebar for the topics.

If there are things you don't know how to do and want help feel free to reach out and I can write more tutorials or clear up confusing information.

CRISPR Screening

So you just ran a CRISPR screen and got your sequencing files back - what next?

The analysis follows a few steps:

Mapping your sequencing reads to your reference library
Differential expression of guide RNA abundances
Aggregation of guide differential expression into differential expression of genes
Vizualization and interpretation

Counting sgRNA abundance

Tools Needed

For this step of the analysis we'll need two different tools:

Each of these tools has their own documentation (fxtools and sgcount) for more detailed usage - but I will describe everything you need for your analysis here.

Installation

If you've never used any rust tools before - you will need to install cargo the rust package manager.

# installs `cargo`
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Then we can install these tools easily with cargo

cargo install fxtools
cargo install sgcount

You can validate that they installed with the following:

fxtools --version
sgcount --version

Reference Library

Reference library format

You will need to identify what your reference library is. In most cases this is a either a csv or a fasta file of sgrna names and sequences.

The input we expect should be a fasta file of the sgrna library. Here is an example of what that looks like:

>lib.0
ATAGCCCGGCGGTCTGCTGG
>lib.1
TAAGGCACTATAGCAATGAG
>lib.2
GTAGATAAAACGTGTGGCCC
>lib.3
AAGGCGACCATCTACCCTTG

Processing reference library

There are a few processing steps that we will perform to validate that the reference library is in the expected format and without errors. Then we will generate an sgrna -> gene mapping file.

Uppercase

We will first convert the fasta to fully uppercase

fxtools upper -i <your_fasta.fa> -o upper.fa

Unique

We will then identify any duplicate sequences and remove them from the analysis.

fxtools unique -i upper.fa -o uniq.fa -n ambiguous.fa

sgRNA to gene mapping

If your sgrna library is named predictably of the form: <gene>_<information>... we can automatically generate the sgrna to gene mapping with fxtools

fxtools sgrna-table -i uniq.fa -o g2s.txt

Variable regions

This next step will not apply to everyone - and it depends on whether your reference library is purely the variable region of your guides or if it also contains the constant adapter sequences.

Generally if your sequences are longer than about 23bp then your adapater sequence is still on. You can validate this by seeing if all your sequences share either the same prefix or suffix.

If so - we will need to remove those constant regions and just operate with the variable region.

We can use fxtools for this:

fxtools extract-variable -i uniq.fa -o uniq.var.fa

Guide Counting

Now that we have prepared our reference library we are ready to map our records with sgcount.

Running `sgcount`

sgcount \
  -l uniq.fa \
  -i sample_a.fq.gz sample_b.fq.gz sample_c.fq.gz sample_d.fq.gz \
  -n a b c d \
  -t 4 \
  -g g2s.txt \
  -o mapping.tab

In the above command we are:

providing our reference library (uniq.fa or uniq.var.fa) generated previously.
providing our fastq files representing our sgrna library to the -i flag,
providing an optional custom renaming of those files with the -n flag,
stating we will be using 4 threads with the -t flag,
providing our gene to sgrna mapping with the -g flag,
and writing our results to the output file mapping.tab with the -o flag.

Thats it! Once it finishes we're done with this step.

More information

For more detailed information on optional arguments to this check out the sgcount documentation or for information on how it works at sgcount implementation

Differential Expression

Now that we have finished mapping our reads we are ready to perform differential expression.

The following step applies for simple screens or 1-dimensional screens which are generally of the form:

sample_low_1
sample_low_2
sample_high_1
sample_high_2

Note: More complicated schemas For more complicated schemas I recommend using DESeq2 for sgRNA expression then aggregating gene-level information with sgagg.

Installation

For this analysis we will need to use crispr_screen. We can install it easily with the rust package manager cargo

cargo install crispr_screen

Usage

To run our analysis we just use the following command:

crispr_screen test \
  -i mapping.tab \
  -c sample_low_1 sample_low_2 \
  -t sample_high_1 sample_high_2

This will create 2 output files results.gene_results.tab and results.sgrna_results.tab which reflect the differential expression at the gene and sgrna level respectively.

More Information

This will perform the default enrichment analysis - but for more details or alternative analysis methods check out the crispr_screen documentation.

Visualization

Now we come to the best part - looking at your results!

If you're proficient in python or R this step can be skipped and you can roll your own visualization method with the results.gene_results.tab and results.sgrna_results.tab dataframes.

But if you're looking for a quick visualization toolkit you can use my tool screenviz

Installation

You can install it straight from github with the python package manager pip

pip install git+https://github.com/noamteyssier/screenviz

CROP-Seq

This section is for those starting their analysis of CROP-seq datasets. It will focus on taking your sequencing data to numerical data and show you how to install what you need, align your reads, and assign all cells carrying an sgRNA to their respective guide.

Overview

A brief overview of what you will do:

Prepare your environment for the analysis
- Installation
- Directory Management
Sequencing Alignment
- Aligning to the transcriptome
- Aligning to the sgRNA library
sgRNA Assignment
- Statistically assigning cells to sgRNAs
- Intersecting 10X and sgRNA library into a single dataset.

What to do next

From there you will be able to use other tutorials written elsewhere for the single-cell analysis.

Some helpful links to those below:

scanpy

Installation

There are many tools that you will need to use for the purposes of this analysis. These tools will require you to have a basic understanding of package management and tool usage at the command line, python, and R.

There are many resources out there for how to install specific packages so I will not spend too much time on the details of installation.

Mandatory Note

I highly recommend creating a new conda environment for the purposes of only this analysis.

As you generate new datasets you can copy this environment over to them, but as you make changes you should leave this one unchanged so that you can always come back to the same analysis without any unexpected quirks.

Commandline Tools

conda
- Used for environment management
kallisto bustools
- Used for sequence alignment to the transcriptome
kallisto
- is the core tool for pseudoalignment and pseudoalignment reference building
- a version of this will be packaged into kallisto-bustools but if you don't want to search for the binary you can just install this separately
bustools
- a tool for interacting with *.bus files which are used to organize and interact with single-cell data results.
- Some specifications about the BUS File Format
- a version of this will be packaged into kallisto-bustools but if you don't want to search for the binary you can just install this separately
jupyter
- Used for generating an interactive environment to explore your data (overview)

Python Modules

scanpy
- Used for single-cell analysis
scvelo
- Used for single-cell RNA velocity analysis (Optional)
geomux
- used for cell-guide assignment
hgsig
- used for cluster enrichment/depletion and significance testing

R Libraries

edgeR
- used for differential expression analysis

Environment

As mentioned previously, we will create our conda environment now and add in our modules as we need them.

Installing Conda

I recommend you install Miniconda which is a minimal installer conda. It has a very small footprint and works great.

See the installation instructions here.

Follow the above instructions and check if everything works:

conda --version

This should give you a version number of your installation.

Create Conda Environment

Now we are ready to create our conda environment for our analysis.

Make sure you only use this environment for a single analysis. Do no reuse environments between analyses since this may lead to possible versioning issues and your analysis may not be reproducible when you come back to it several months later!

conda create -n my_cropseq python=3.8

Here we are creating an environment called my_cropseq and stating that the python version is 3.8.

You can use alternative python versioning if you want the latest and greatest features, but I found this version plays best with older modules.

Moving into and out of the environment

We can now step into our environment as follows:

conda activate my_cropseq

And we can step out as follows:

conda deactivate

Background Processing

This is a small note to mention that many of the analyis steps upcoming may take a long time to run and you may not be at your computer for their entire length.

Most often you will run this on a remote computer, so I wanted to provide a few links and a quick intro to remote screens. This is software that is optional, but very useful, so that you are not kicked off remote servers when you close your computer or if you have inactivity.

Some nice options for screens are:

There are tutorials elsewhere for how to use these, but again just wanted to mention them here so that you don't have to restart any analyses.

Sequence Processing

The first step of the analysis is taking the sequence data to numerical data. This section of the tutorial is focused on those initial steps.

There will first be two sequencing alignment steps:

Aligning the 10X sequences to the transcriptome
Aligning the PCR sequences to the sgRNA sequences

Then there will be an assignment step to link the 10X reads to their corresponding sgRNA.

Directory Management

Creating a project directory

The first thing we will do is make a project directory to organize the individual blocks of our analysis.

Let's first create a directory skeleton which we will then populate with some data we need to collect from other sources.

# create your root directory as well as a meta and sequence directory
mkdir -p project_name/{meta, sequence}

# create a 10X and a PCR directory within your sequence directory
mkdir -p project_name/sequence/{10X,pcr}

# create a some meta directories we will populate later
mkdir -p project_name/meta/{index,whitelist,cropseq}

You should now have a directory structure that looks like this:

project_name/
└── meta/
    ├── 10X_index/
    ├── pcr_index/
    └── whitelist/
└── sequence/
    ├── 10X/
    └── pcr/

At this point you will want to move your 10X files to the project_name/sequence/10X directory and your enrichment PCR files to the project_name/sequence/pcr directory so that you have the following directory structure:

project_name/
└── meta/
    ├── 10X_index/
    ├── pcr_index/
    └── whitelist/
└── sequence/
    └── 10X/
        ├── sampleX_R1.fastq.gz
        ├── sampleX_R2.fastq.gz
        :        ...
        ├── sampleY_R1.fastq.gz
        └── sampleY_R2.fastq.gz
    └── pcr/
        ├── sampleX_R1.fastq.gz
        ├── sampleX_R2.fastq.gz
        :        ...
        ├── sampleY_R1.fastq.gz
        └── sampleY_R2.fastq.gz

Metadata

Whitelist

The whitelist is a set of cell barcodes you expect your reads to come from. This is usually specified by the technology and can be found online.

Here you can find the 10Xv3 cell barcodes in my github mirror. But you can also find them in your installation of CellRanger or kallisto.

Note:

You will use the same whitelist for both the 10X sequences and the PCR sequences as they come from the same set of cells. This step is not required if you are aligning with kallisto because it can provide you with the same whitelist. But it will save space to just provide this file as otherwise it will write to disk the same whitelist for each sample you process

# Download the 10X v3 Whitelist from my github mirror
wget \
  https://github.com/noamteyssier/10x_whitelist_mirror/raw/main/3M-february-2018.txt.gz \
  project_name/meta/whitelist/3M-february-2018.txt.gz

Kallisto Index

The index is a kmer de-bruijn graph representation of the transcriptomic sequences you will be aligning your 10X sequences to.

Note:

If you are interested in learning more about the kallisto index I highly recommend reading the kallisto paper which describes the pseudo-alignment algorithm and the index. {{#cite bray_near-optimal_2016}}

If you are interested in building your own reference index please check out the kb tutorial

For the purposes of this tutorial we will be using a prebuilt index generated for RNA velocity analyses.

We will also be building our own index in an upcoming section so you can look at how to do that there as well.

RNA Velocity Transcriptome

Even if you are not interested in doing an RNA velocity analysis - it doesn't hurt you to align to a spliced/unspliced reference.

The counts are aggregated between the spliced/unspliced records in the downstream data, but you have an added option of observing the splice/unsplice ratio

# moving to the planned directory
cd project_name/meta/index/

# Downloading the Reference Index
kb ref \
  -d linnarsson \
  -i index.idx \
  -g t2g.txt \
  -c1 spliced_t2c.txt \
  -c2 unspliced_t2c.txt

The above command will download the reference index and create the following files:

project_name/meta/10X_index/index.idx
project_name/meta/10X_index/t2g.txt
project_name/meta/10X_index/spliced_t2c.txt
project_name/meta/10X_index/unspliced_t2c.txt

These will be used in the next section when we run our 10X alignment.

Target Sequences

The cropseq.fa file here is a stand-in for the fasta formatted sequences of your expected guide sequences.

You will want to place this under project_name/meta/pcr_index/

Filename: `cropseq.fa`

>rs17057051_i1
TTGTCCAGCATTCTGCTTCAATGGTTTAAGAGC
>rs17057051_i2
TTGGTCTCCATTGAAGATGTGTTGTTTAAGAGC
              ...
>rs1532277_i1
TTGCAGAACTCTAGCAAGACGTGGTTTAAGAGC
>rs1532277_i2
TTGGAATCTGGGCATTAGGCCCTGTTTAAGAGC

Final Directory Structure

I highly recommend you have a directory structure like the following for your project.

Downstream code will be expecting this directory structure but if you feel strongly about your own structure then it can be easily modified for your own interests.

project_name/
└── meta/
    └── whitelist/
        └── 3M-february-2018.txt.gz
    └── 10X_index/
        ├── index.idx 
        ├── t2g.txt
        ├── spliced_t2c.txt
        └── unspliced_t2c.txt
    └── pcr_index/
        └── cropseq.fa
└── sequence/
    └── 10X/
        ├── sampleX_R1.fastq.gz
        ├── sampleX_R2.fastq.gz
        :        ...
        ├── sampleY_R1.fastq.gz
        └── sampleY_R2.fastq.gz
    └── pcr/
        ├── sampleX_R1.fastq.gz
        ├── sampleX_R2.fastq.gz
        :        ...
        ├── sampleY_R1.fastq.gz
        └── sampleY_R2.fastq.gz

10X Alignment

This section will cover the sequence alignment steps for the 10X sequences. It assumes you have a similar directory structure to the one described in the previous sections.

Installation

We are going to use kb-python for sequencing alignment, which is a wrapper of the kallisto and bustools commandline tools.

# enter our conda environment (can skip if already in environment)
conda activate my_cropseq

# install kb-python via pip
pip install kb-python

# check if installation worked
kb --version

Creating your results directory

kb will create an output directory for each of the samples. This quickly gets unwieldy, so I recommend creating a new directory in your project to hold these data for easy access later.

From the root of your project directory run:

mkdir -p alignment/10X

Your project directory should now look like the following:

project_name/
└── meta/
    └── whitelist/
        └── 3M-february-2018.txt.gz
    └── 10X_index/
        ├── index.idx 
        ├── t2g.txt
        ├── spliced_t2c.txt
        └── unspliced_t2c.txt
    └── pcr_index/
└── sequence/
    └── 10X/
        ├── sampleX_R1.fastq.gz
        ├── sampleX_R2.fastq.gz
        :        ...
        ├── sampleY_R1.fastq.gz
        └── sampleY_R2.fastq.gz
    └── pcr/
└── alignment/
    └── 10X/

Running `kb count`

We can now run the alignment step which will create the cell x gene count matrix.

We will do this using the kb count submodule of kb.

kb count \
  --verbose \
  --h5ad \
  -t 4 \
  -x 10xv3 \
  -w meta/whitelist/3M-february-2018.txt.gz \
  -g meta/10X_index/t2g.txt \
  -i meta/10X_index/index.idx \
  --workflow lamanno \
  --filter bustools \
  -c1 meta/10X_index/spliced_t2c.txt \
  -c2 meta/10X_index/unspliced_t2c.txt \
  -o alignment/10X/SampleX \
  SampleX_R1.fastq.gz SampleX_R2.fastq.gz

Running multiple sequencing datasets as one sample

Sometimes you have the same sample split across multiple sequencing files. kb can handle this easily, and the only change to the above command is to include the path of all files under the same sample name at the end.

kb count \
  --verbose \
  --h5ad \
  -t 4 \
  -x 10xv3 \
  -w meta/whitelist/3M-february-2018.txt.gz \
  -g meta/10X_index/t2g.txt \
  -i meta/10X_index/index.idx \
  --workflow lamanno \
  --filter bustools \
  -c1 meta/10X_index/spliced_t2c.txt \
  -c2 meta/10X_index/unspliced_t2c.txt \
  -o alignment/10X/SampleX \
  SampleX_1_R1.fastq.gz SampleX_1_R2.fastq.gz \
  SampleX_2_R1.fastq.gz SampleX_2_R2.fastq.gz \
  SampleX_3_R1.fastq.gz SampleX_3_R2.fastq.gz

Running multiple samples at once

Parallel processing of multiple samples at once is outside the scope of this tutorial, but if you are interested in efficient ways of doing this I recommend checking out:

Otherwise, you can take the above command and put it into a for-loop fairly easily:

#!/bin/bash

for sample_id in sequence/10X/*R1.fastq.gz;
do
    basename=$(basename -s ".fastq.gz" $sample_id);
    r1=$sample_id;
    r2=${r1/R1/R2};

    kb count \
      --verbose \
      --h5ad \
      -t 4 \
      -x 10xv3 \
      -w meta/whitelist/3M-february-2018.txt.gz \
      -g meta/10X_index/t2g.txt \
      -i meta/10X_index/index.idx \
      --workflow lamanno \
      --filter bustools \
      -c1 meta/10X_index/spliced_t2c.txt \
      -c2 meta/10X_index/unspliced_t2c.txt \
      -o alignment/10X/${basename} \
      $r1 $r2
    
done

Results

Once kb has finished we can take a look at the data and see if it is as expected.

If you explore the directory alignment/10X/<your_sample_name> you should see the following:

.
├── 10xv3_whitelist.txt
├── counts_filtered
│   ├── adata.h5ad
│   ├── spliced.barcodes.txt
│   ├── spliced.genes.txt
│   ├── spliced.mtx
│   ├── unspliced.barcodes.txt
│   ├── unspliced.genes.txt
│   └── unspliced.mtx
├── counts_unfiltered
│   ├── adata.h5ad
│   ├── spliced.barcodes.txt
│   ├── spliced.genes.txt
│   ├── spliced.mtx
│   ├── unspliced.barcodes.txt
│   ├── unspliced.genes.txt
│   └── unspliced.mtx
├── filter_barcodes.txt
├── inspect.json
├── inspect.spliced.json
├── inspect.unspliced.json
├── kb_info.json
├── matrix.ec
├── output.bus
├── output.filtered.bus
├── output.unfiltered.bus
├── run_info.json
├── spliced.filtered.bus
├── spliced.unfiltered.bus
├── transcripts.txt
├── unspliced.filtered.bus
└── unspliced.unfiltered.bus

The cell x gene matrix will be under counts_filtered/adata.h5ad or counts_unfiltered/adata.h5ad.

You can decide which of these you will use downstream, though I recommend using the counts_filtered version.

PCR Alignment

Now we will move onto the alignment of the enrichment PCR sequences to the sgRNA library.

We will also be using kb for this alignment, but we first need to create a new custom index for the sgRNA library.

sgRNA Index

We will be creating the index from the sgRNA library fasta file that we included in the Metadata section of the tutorial: cropseq.fa.

As a reminder this file is under: project_name/meta/pcr_index and has the following structure:

Filename: `cropseq.fa`

>rs17057051_i1
TTGTCCAGCATTCTGCTTCAATGGTTTAAGAGC
>rs17057051_i2
TTGGTCTCCATTGAAGATGTGTTGTTTAAGAGC
              ...
>rs1532277_i1
TTGCAGAACTCTAGCAAGACGTGGTTTAAGAGC
>rs1532277_i2
TTGGAATCTGGGCATTAGGCCCTGTTTAAGAGC

We will create our kallisto index from these sequences with the following command:

kallisto index -k 15 -i cropseq.idx cropseq.fa

Note:

I have set the kmer size to 15 here because I have found this to be the optimal size in my own data. Feel free to adjust this value if you have strong opinions on why a different value would be best.

Project Directory

After creating this file we will also create a new directory to hold our alignment results:

mkdir -p alignment/pcr

Your project directory should now look like the following:

project_name/
└── meta/
    └── whitelist/
        └── 3M-february-2018.txt.gz
    └── 10X_index/
        ├── index.idx 
        ├── t2g.txt
        ├── spliced_t2c.txt
        └── unspliced_t2c.txt
    └── pcr_index/
        ├── cropseq.fa
        └── cropseq.idx
└── sequence/
    └── 10X/
        ├── sampleX_R1.fastq.gz
        ├── sampleX_R2.fastq.gz
        :        ...
        ├── sampleY_R1.fastq.gz
        └── sampleY_R2.fastq.gz
    └── pcr/
        ├── sampleX_R1.fastq.gz
        ├── sampleX_R2.fastq.gz
        :        ...
        ├── sampleY_R1.fastq.gz
        └── sampleY_R2.fastq.gz
└── alignment/
    ├── 10X/
    └── pcr/

Alignment

Now we align our PCR sequences to the newly created index using kb similar to the previous alignment we did for the 10X sequences.

kb count \
  --verbose \
  --h5ad \
  -t 4 \
  -x 10xv3 \
  -w meta/whitelist/3M-february-2018.txt.gz \
  -g meta/pcr_index/t2g.txt \
  -i meta/pcr_index/index.idx \
  --filter bustools \
  -o alignment/pcr/SampleX \
  SampleX_R1.fastq.gz SampleX_R2.fastq.gz

Running multiple sequencing datasets as one sample

kb count \
  --verbose \
  --h5ad \
  -t 4 \
  -x 10xv3 \
  -w meta/whitelist/3M-february-2018.txt.gz \
  -g meta/pcr_index/t2g.txt \
  -i meta/pcr_index/index.idx \
  --filter bustools \
  -o alignment/pcr/SampleX \
  SampleX_1_R1.fastq.gz SampleX_1_R2.fastq.gz \
  SampleX_2_R1.fastq.gz SampleX_2_R2.fastq.gz \
  SampleX_3_R1.fastq.gz SampleX_3_R2.fastq.gz

Running multiple samples at once

Parallel processing of multiple samples at once is outside the scope of this tutorial, but if you are interested in efficient ways of doing this I recommend checking out:

Otherwise, you can take the above command and put it into a for-loop fairly easily:

#!/bin/bash

for sample_id in sequence/pcr/*R1.fastq.gz;
do
    basename=$(basename -s ".fastq.gz" $sample_id);
    r1=$sample_id;
    r2=${r1/R1/R2};

    kb count \
      --verbose \
      --h5ad \
      -t 4 \
      -x 10xv3 \
      -w meta/whitelist/3M-february-2018.txt.gz \
      -g meta/pcr_index/t2g.txt \
      -i meta/pcr_index/index.idx \
      --filter bustools \
      -o alignment/pcr/SampleX \
      $r1 $r2;
    
done

Results

Once kb has finished we can take a look at the data and see if it is as expected.

If you explore the directory alignment/pcr/<your_sample_name> you should see the following:

.
├── 10xv3_whitelist.txt
├── counts_filtered
│   ├── adata.h5ad
│   ├── spliced.barcodes.txt
│   ├── spliced.genes.txt
│   ├── spliced.mtx
│   ├── unspliced.barcodes.txt
│   ├── unspliced.genes.txt
│   └── unspliced.mtx
├── counts_unfiltered
│   ├── adata.h5ad
│   ├── spliced.barcodes.txt
│   ├── spliced.genes.txt
│   ├── spliced.mtx
│   ├── unspliced.barcodes.txt
│   ├── unspliced.genes.txt
│   └── unspliced.mtx
├── filter_barcodes.txt
├── inspect.json
├── inspect.spliced.json
├── inspect.unspliced.json
├── kb_info.json
├── matrix.ec
├── output.bus
├── output.filtered.bus
├── output.unfiltered.bus
├── run_info.json
├── spliced.filtered.bus
├── spliced.unfiltered.bus
├── transcripts.txt
├── unspliced.filtered.bus
└── unspliced.unfiltered.bus

The cell x gene matrix will be under counts_filtered/adata.h5ad or counts_unfiltered/adata.h5ad.

You can decide which of these you will use downstream, though I recommend using the counts_filtered version.

Assignment

Now that we have the PCR sequences aligned to their respective guides we can perform the assignment step.

Summary

The assignment step will be done with geomux.

This is a tool I created based on a hypergeometric test which can assign cells to guides with high confidence.

The intuition behind the mathematics is that every cell with an sgRNA will have a background distribution (or sequencing noise) and a signal distribution (i.e. its true sgRNAs). The hypergeometric test is used to identify what sgRNA abundance occurs above random chance, and then a ratio test is used to select for cells with only a single signal sgRNA (i.e. filter against high multiplicity of infection).

Installation

We can install geomux directly from its github repo.

pip install git+https://github.com/noamteyssier/geomux

Usage

We can use geomux directly from the command line, or we can use it from a jupyter notebook.

I prefer to run it from the commandline and will include those instructions here.

If you prefer to use it from a jupyter notebook, check out the tutorial on the repo.

Creating an assignment directory

First we create a directory to hold our assignment results.

mkdir -p meta/assignment/

Running `geomux`

We can now run geomux with the following command:

geomux \
    -i alignment/pcr/sampleX/counts_filtered/adata.h5ad \
    -o meta/assignment/sampleX.tab

This will create a tab-separated file (*.tab) which will have the cell-barcode, associated guide, and statistics about the assignment provided.

Intersecting the 10X sequencing data and the Assignments

We have now done all the steps necessary to intersect the 10X single-cell sequencing data with the enrichment PCR.

We can now just load in the 10X sequencing data and merge the dataframes together to have our cells assigned to their respective sgRNAs.

The next step will assume you have scanpy installed already.

import pandas as pd
import scanpy as sc

# load assignments
assignments = pd.read_csv(
    "meta/assignment/sampleX.tab", 
    sep="\t", 
    index_col="barcode"
)

# load anndata (10X)
adata = sc.read(
    "alignment/10x/sampleX/counts_filtered/adata.h5ad"
)

# merge the two
adata.obs = adata.obs.merge(
    assignment, 
    left_index=True, 
    right_index=True, 
    how="left"
)

# drop cells that do not have an assignment
adata = adata.obs[~adata.obs.guide.isna()].copy()

# view the merged anndata
adata

How Do I...?