Introducing xsra - a faster fasterq-dump with built-in compression and BINSEQ output

Introducing xsra - a faster fasterq-dump with built-in compression§

Background§

The Sequence Read Archive and fasterq-dump§

The Sequence Read Archive (SRA) is the largest publicly available repository of high throughput sequencing data - but the *.sra file format is complicated and unimportant

To get usable reads you’re effectively stuck with fasterq-dump, the de facto standard component of sra-tools (a suite of CLI tools provided by the NCBI). While it's a parallel version of the original fastq-dump with significant performance gains, it still has major limitations.

fasterq-dump does what its supposed to do - it dumps FASTQ records from an SRA file, but it has 2 significant drawbacks that make it less than ideal.

  1. It creates a bunch of temporary files to output sorted records.
  2. It doesn't compress any of the output it writes (including temporary files). In its documentation it warns that you should expect 17 times the size of the SRA accession during conversion - a massive storage burden.

At Arc we identified fasterq-dump as a major bottleneck in our scRecounter pipeline - a component of the scBaseCamp project. The uncompressed output was wasting cloud storage costs and adding unnecessary processing steps. We needed better programmatic access to SRA files to extract only the information we needed (i.e. sequences) and compress it efficiently without additional steps.

This need aligned perfectly with my work on the BINSEQ file format family, as programmatic access to SRA files would allow direct writing to BINSEQ files, bypassing FASTQ altogether.

xsra - a simplified fasterq-dump§

These challenged motived the development ofxsra - a faster more disk-friendly alternative to fasterq-dump written in rust.

Features§

  1. Parallel extraction of sequencing records
  2. Optional compression for FASTQ and FASTA (gzip, bgzip, zstd)
  3. SRA → binseq (*.bq and *.vbq) conversion
  4. Read segment level extraction (i.e. only reads 1, 2 -- or only read 3 -- etc.)
  5. Network streaming of SRA files
  6. Prefetching SRA files to disk

Key differences from fasterq-dump§

xsra differs from fasterq-dump in several important ways:

  1. Zero temporary files - dramatically reducing disk usage
  2. Records are not guaranteed to be in the same order as the original SRA file (unless run without multiple threads), though read segments are always correctly paired.
  3. Original record headers are not preserved (sequences and quality scores remain untouched)
  4. FASTQ output does not repeat the header on the + line (3rd line)

If these differences are deal-breakers for your workflow, fasterq-dump remains available. However, for many applications where sequence order doesn't matter and headers are irrelevant, xsra offers substantial efficiency gains in both processing time and storage.

Separation of CLI and Library§

xsra functions as a CLI utility, but the development process also yielded ncbi-vdb-sys - a Rust crate (library) that interoperates with the original ncbi-vdb C library through FFI bindings.

ncbi-vdb-sys provides safe wrappers around the C library and enables programmatic interaction with SRA files from Rust. It supports random record access and random segment access on each record. For more information, see its documentation.

Benchmarks§

The following benchmarks use accession SRR27592687 as an example. For benchmarking details see benchmarking repo.

Runtime§

xsra-runtime

For uncompressed FASTQ, xsra is at least 5x faster than fasterq-dump, about 2x faster for compressed output, and almost 10x faster writing *.bq files compared to FASTQ.

Note: fastq-dump has asterisks because it is not multi-threaded.

Disk Usage§

xsra-disk

In terms of disk usage, xsra dramatically reduces space requirements by eliminating temporary files and record headers, with further reductions achieved through built-in compression.

Conclusion§

This project has been on my mind for several years (since opening that original GitHub issue), and I'm pleased to have finally had time to develop it. I hope xsra proves helpful to the bioinformatics community.

You can get started easily with the Rust package manager cargo:

cargo install xsra