PCR Alignment
Now we will move onto the alignment of the enrichment PCR sequences to the sgRNA library.
We will also be using kb
for this alignment, but we first need to create
a new custom index for the sgRNA library.
sgRNA Index
We will be creating the index from the sgRNA library fasta file that we
included in the Metadata section of the tutorial: cropseq.fa
.
As a reminder this file is under: project_name/meta/pcr_index
and
has the following structure:
Filename: cropseq.fa
>rs17057051_i1
TTGTCCAGCATTCTGCTTCAATGGTTTAAGAGC
>rs17057051_i2
TTGGTCTCCATTGAAGATGTGTTGTTTAAGAGC
...
>rs1532277_i1
TTGCAGAACTCTAGCAAGACGTGGTTTAAGAGC
>rs1532277_i2
TTGGAATCTGGGCATTAGGCCCTGTTTAAGAGC
We will create our kallisto
index from these sequences with the following command:
kallisto index -k 15 -i cropseq.idx cropseq.fa
Note:
I have set the kmer size to 15 here because I have found this to be the optimal size in my own data. Feel free to adjust this value if you have strong opinions on why a different value would be best.
Project Directory
After creating this file we will also create a new directory to hold our alignment results:
mkdir -p alignment/pcr
Your project directory should now look like the following:
project_name/
└── meta/
└── whitelist/
└── 3M-february-2018.txt.gz
└── 10X_index/
├── index.idx
├── t2g.txt
├── spliced_t2c.txt
└── unspliced_t2c.txt
└── pcr_index/
├── cropseq.fa
└── cropseq.idx
└── sequence/
└── 10X/
├── sampleX_R1.fastq.gz
├── sampleX_R2.fastq.gz
: ...
├── sampleY_R1.fastq.gz
└── sampleY_R2.fastq.gz
└── pcr/
├── sampleX_R1.fastq.gz
├── sampleX_R2.fastq.gz
: ...
├── sampleY_R1.fastq.gz
└── sampleY_R2.fastq.gz
└── alignment/
├── 10X/
└── pcr/
Alignment
Now we align our PCR sequences to the newly created index using kb
similar to
the previous alignment we did for the 10X sequences.
kb count \
--verbose \
--h5ad \
-t 4 \
-x 10xv3 \
-w meta/whitelist/3M-february-2018.txt.gz \
-g meta/pcr_index/t2g.txt \
-i meta/pcr_index/index.idx \
--filter bustools \
-o alignment/pcr/SampleX \
SampleX_R1.fastq.gz SampleX_R2.fastq.gz
Running multiple sequencing datasets as one sample
Sometimes you have the same sample split across multiple sequencing files.
kb
can handle this easily, and the only change to the above command is
to include the path of all files under the same sample name at the end.
kb count \
--verbose \
--h5ad \
-t 4 \
-x 10xv3 \
-w meta/whitelist/3M-february-2018.txt.gz \
-g meta/pcr_index/t2g.txt \
-i meta/pcr_index/index.idx \
--filter bustools \
-o alignment/pcr/SampleX \
SampleX_1_R1.fastq.gz SampleX_1_R2.fastq.gz \
SampleX_2_R1.fastq.gz SampleX_2_R2.fastq.gz \
SampleX_3_R1.fastq.gz SampleX_3_R2.fastq.gz
Running multiple samples at once
Parallel processing of multiple samples at once is outside the scope of this tutorial, but if you are interested in efficient ways of doing this I recommend checking out:
Otherwise, you can take the above command and put it into a for-loop fairly easily:
#!/bin/bash
for sample_id in sequence/pcr/*R1.fastq.gz;
do
basename=$(basename -s ".fastq.gz" $sample_id);
r1=$sample_id;
r2=${r1/R1/R2};
kb count \
--verbose \
--h5ad \
-t 4 \
-x 10xv3 \
-w meta/whitelist/3M-february-2018.txt.gz \
-g meta/pcr_index/t2g.txt \
-i meta/pcr_index/index.idx \
--filter bustools \
-o alignment/pcr/SampleX \
$r1 $r2;
done
Results
Once kb
has finished we can take a look at the data and see
if it is as expected.
If you explore the directory alignment/pcr/<your_sample_name>
you should see
the following:
.
├── 10xv3_whitelist.txt
├── counts_filtered
│ ├── adata.h5ad
│ ├── spliced.barcodes.txt
│ ├── spliced.genes.txt
│ ├── spliced.mtx
│ ├── unspliced.barcodes.txt
│ ├── unspliced.genes.txt
│ └── unspliced.mtx
├── counts_unfiltered
│ ├── adata.h5ad
│ ├── spliced.barcodes.txt
│ ├── spliced.genes.txt
│ ├── spliced.mtx
│ ├── unspliced.barcodes.txt
│ ├── unspliced.genes.txt
│ └── unspliced.mtx
├── filter_barcodes.txt
├── inspect.json
├── inspect.spliced.json
├── inspect.unspliced.json
├── kb_info.json
├── matrix.ec
├── output.bus
├── output.filtered.bus
├── output.unfiltered.bus
├── run_info.json
├── spliced.filtered.bus
├── spliced.unfiltered.bus
├── transcripts.txt
├── unspliced.filtered.bus
└── unspliced.unfiltered.bus
The cell x gene
matrix will be under counts_filtered/adata.h5ad
or counts_unfiltered/adata.h5ad
.
You can decide which of these you will use downstream, though I
recommend using the counts_filtered
version.