10X Alignment
This section will cover the sequence alignment steps for the 10X sequences. It assumes you have a similar directory structure to the one described in the previous sections.
Installation
We are going to use kb-python
for sequencing alignment, which is a wrapper
of the kallisto
and bustools
commandline tools.
# enter our conda environment (can skip if already in environment)
conda activate my_cropseq
# install kb-python via pip
pip install kb-python
# check if installation worked
kb --version
Creating your results directory
kb
will create an output directory for each of the samples.
This quickly gets unwieldy, so I recommend creating a new directory
in your project to hold these data for easy access later.
From the root of your project directory run:
mkdir -p alignment/10X
Your project directory should now look like the following:
project_name/
└── meta/
└── whitelist/
└── 3M-february-2018.txt.gz
└── 10X_index/
├── index.idx
├── t2g.txt
├── spliced_t2c.txt
└── unspliced_t2c.txt
└── pcr_index/
└── sequence/
└── 10X/
├── sampleX_R1.fastq.gz
├── sampleX_R2.fastq.gz
: ...
├── sampleY_R1.fastq.gz
└── sampleY_R2.fastq.gz
└── pcr/
└── alignment/
└── 10X/
Running kb count
We can now run the alignment step which will create the cell x gene
count matrix.
We will do this using the kb count
submodule of kb
.
kb count \
--verbose \
--h5ad \
-t 4 \
-x 10xv3 \
-w meta/whitelist/3M-february-2018.txt.gz \
-g meta/10X_index/t2g.txt \
-i meta/10X_index/index.idx \
--workflow lamanno \
--filter bustools \
-c1 meta/10X_index/spliced_t2c.txt \
-c2 meta/10X_index/unspliced_t2c.txt \
-o alignment/10X/SampleX \
SampleX_R1.fastq.gz SampleX_R2.fastq.gz
Running multiple sequencing datasets as one sample
Sometimes you have the same sample split across multiple sequencing files.
kb
can handle this easily, and the only change to the above command is to
include the path of all files under the same sample name at the end.
kb count \
--verbose \
--h5ad \
-t 4 \
-x 10xv3 \
-w meta/whitelist/3M-february-2018.txt.gz \
-g meta/10X_index/t2g.txt \
-i meta/10X_index/index.idx \
--workflow lamanno \
--filter bustools \
-c1 meta/10X_index/spliced_t2c.txt \
-c2 meta/10X_index/unspliced_t2c.txt \
-o alignment/10X/SampleX \
SampleX_1_R1.fastq.gz SampleX_1_R2.fastq.gz \
SampleX_2_R1.fastq.gz SampleX_2_R2.fastq.gz \
SampleX_3_R1.fastq.gz SampleX_3_R2.fastq.gz
Running multiple samples at once
Parallel processing of multiple samples at once is outside the scope of this tutorial, but if you are interested in efficient ways of doing this I recommend checking out:
Otherwise, you can take the above command and put it into a for-loop fairly easily:
#!/bin/bash
for sample_id in sequence/10X/*R1.fastq.gz;
do
basename=$(basename -s ".fastq.gz" $sample_id);
r1=$sample_id;
r2=${r1/R1/R2};
kb count \
--verbose \
--h5ad \
-t 4 \
-x 10xv3 \
-w meta/whitelist/3M-february-2018.txt.gz \
-g meta/10X_index/t2g.txt \
-i meta/10X_index/index.idx \
--workflow lamanno \
--filter bustools \
-c1 meta/10X_index/spliced_t2c.txt \
-c2 meta/10X_index/unspliced_t2c.txt \
-o alignment/10X/${basename} \
$r1 $r2
done
Results
Once kb
has finished we can take a look at the data and see
if it is as expected.
If you explore the directory alignment/10X/<your_sample_name>
you should see
the following:
.
├── 10xv3_whitelist.txt
├── counts_filtered
│ ├── adata.h5ad
│ ├── spliced.barcodes.txt
│ ├── spliced.genes.txt
│ ├── spliced.mtx
│ ├── unspliced.barcodes.txt
│ ├── unspliced.genes.txt
│ └── unspliced.mtx
├── counts_unfiltered
│ ├── adata.h5ad
│ ├── spliced.barcodes.txt
│ ├── spliced.genes.txt
│ ├── spliced.mtx
│ ├── unspliced.barcodes.txt
│ ├── unspliced.genes.txt
│ └── unspliced.mtx
├── filter_barcodes.txt
├── inspect.json
├── inspect.spliced.json
├── inspect.unspliced.json
├── kb_info.json
├── matrix.ec
├── output.bus
├── output.filtered.bus
├── output.unfiltered.bus
├── run_info.json
├── spliced.filtered.bus
├── spliced.unfiltered.bus
├── transcripts.txt
├── unspliced.filtered.bus
└── unspliced.unfiltered.bus
The cell x gene
matrix will be under counts_filtered/adata.h5ad
or counts_unfiltered/adata.h5ad
.
You can decide which of these you will use downstream, though I
recommend using the counts_filtered
version.