Introducing paraseq

Introducing paraseq: a minimal-copy FASTQ parser§

Introduction§

FASTQ files are the standard format for storing DNA sequencing data, containing millions or billions of DNA sequences along with their quality scores. Parsing these files efficiently is a pivotal first step for many bioinformatics analyses, from genome assembly to single-cell sequencing. While there are many libraries available to handle FASTQ parsing, the underlying algorithm for the very performant ones is fairly consistent.

A naive FASTQ parser§

A naive approach to a FASTQ parser will iterate over the lines of some input file, group lines of 4, and then return a Record that will look something like this:

pub struct Record {
    id: Vec<u8>,    // The sequence identifier (e.g., "@SRR1234.1")
    seq: Vec<u8>,   // The DNA sequence (e.g., "ATCG...")
    sep: Vec<u8>,   // The separator line (usually just "+")
    qual: Vec<u8>,  // Quality scores for each base in the sequence
}

This works fine in cases where you are parsing a small amount of records and you need ownership of internal data for each record. If this is the case then the cost of reallocating a Vec for each record field is amortized by the overall program that you’re running.

Note here that I am using Vec<u8> instead of String, I will not be enforcing valid UTF-8 as a requirement to the parser. So if you are unused to this representation you can imagine Vec<u8> (or a vector of bytes) as a String and &[u8] (or a slice of bytes) as a &str.

However, in most bioinformatics cases you are processing billions of records, and these small allocations per record build up into very large runtime costs that drastically affect performance. This is the strategy taken by the bio crate.

Zero-Copy Parsing§

Performant libraries skirt this allocation cost with a zero-copy approach to parsing records. They create some Reader which singly owns the incoming stream of records. Whenever a Record is requested the Reader will not clone the underlying data, but rather provide a reference to the underlying buffer.

pub struct Reader {
    buffer: Vec<u8>
}

pub struct RecRecord<'a> {
    data: &'a [u8],
    start: usize,
    end: usize,
}

This avoids the allocation cost of creating a new Vec for every record and allows users to still interact with slices of the underlying data (&[u8]) which is likely the expected argument for downstream functions which interact with sequencing data.

This is an excellent model of parsing, and is the underlying mechanism for popular state-of-the-art sequence parsing libraries like seq_io, needletail, and fastq.

Challenges with Parallel Processing§

While the zero-copy approach works well for sequential processing, parallelizing this strategy introduces new challenges. Simply sharing a single buffer across threads would create contention and synchronization overhead. This leads us to explore how modern FASTQ parsers handle parallel processing while trying to maintain the performance benefits of zero-copy parsing. needletail doesn’t have a built-in implementation for parallel processing, but both seq_io and fastq-rs have a common strategy to their parallel implementation.

Use a RecordSet!§

What seq_io and fastq-rs do is use a RecordSet which is created for each individual thread which owns its own buffer. This RecordSet would effectively act as a thread-local Reader and then hand out RefRecords to the internal process on that thread.

// RecordSet maintains both the raw data and the locations of each record within that data
pub struct RecordSet {
    buffer: Vec<u8>,              // Contains the raw bytes of multiple FASTQ records
    positions: Vec<Positions>,    // Tracks where each record starts and ends in the buffer
}

// Positions stores the byte offsets for each component of a FASTQ record
pub struct Positions {
    start: usize,      // Start of the entire record
    end: usize,        // End of the entire record
    seq_start: usize,  // Where the sequence line begins
    sep_start: usize,  // Where the separator line begins
    qual_start: usize, // Where the quality scores begin
}

The Reader still performs the core parsing algorithm, identifying positions of each record in its buffer, then copies those bytes and positions into the RecordSet. This is still a single copy step for the entire RecordSet, which is very fast (and likely doesn’t reallocate memory). It is likely amortized into the cost of whatever processing is done in parallel anyways so it doesn’t matter much in the long run.

Why do we need that single-copy?§

You may ask though, do we need that copy step?

It seems like the Reader is acting a bit like a middleman in this context and maybe the additional copy step could be skipped if we could fill our thread-local RecordSets directly from the incoming byte-stream.

The problem comes down to the fact that because we’re using fixed size buffers in the RecordSet and Reader there is no guarantee that we only have complete records in the buffer.

For example, if we think of the underlying bytes we may actually have something like this in our buffer after a fill operation:

// Buffer visualization legend:
// i = identifier line
// s = sequence line
// + = separator line
// q = quality score line
// . = newline character

// An example byte buffer
| i.s.+.q. | i.s.+.q. | i.s.+.q. | i.s.+.q. | i.s.+.q. | i.s.+.q. | i
^complete  ^complete  ^complete  ^complete  ^complete  ^complete  ^incomplete

So the Reader’s job when distributing to RecordSets is to ensure that only complete records are passed in so that records aren’t mangled.

// The byte buffer pased to the record set is only complete records
| i.s.+.q. | i.s.+.q. | i.s.+.q. | i.s.+.q. | i.s.+.q. | i.s.+.q. |

// The remaining byte buffer is initially just the incomplete record
| i

// The next fill operation picks up where the stream
// left off with an unmangled record
| i.s.+.q. | i.s.+.q. | i.s.+.q. | i.s.+.q. | i.s.+.q. | i.s.+.q. |

If the Reader was skipped and each RecordSet was pulling bytes directly from the input stream and filling a fixed length buffer then we are almost guaranteed to end up with mangled records.

// An example fill from RecordSet_1
| i.s.+.q. | i.s.+.q. | i.s.+.q. | i.s.+.q. | i

// The next fill from RecordSet_2
| s.+.q. | i.s.+.q. | i.s.+.q. | i.s.+.q. | i.s

// The next fill from RecordSet_3
| +.q. | i.s.+.q. | i.s.+.q. | i.s.+.q. | i.s.+

Using the Reader as an intermediate to identify full records skirts this problem and guarantees that each RecordSet will have a buffer of uniquely complete records, but it has that extra copy step.

So in the parallel context these zero-copy parsers are actually one-copy parsers.

Introducing a minimal-copy approach§

Here I’ll present an alternative approach which I’m calling a minimal-copy parser. This approach still uses a middle-man Reader like the one-copy parsers but largely minimizes the amount of bytes actually copied between the Reader and RecordSets. This is done by changing the architectures of both structs to the following:

pub struct Reader<R: std::io::Read> {
    handle: R,
    overflow: Vec<u8>,
}
pub struct RecordSet {
    buffer: Vec<u8>,
}

Then, for each RecordSet fill operation the RecordSet follows these steps in order:

  1. Read all bytes from the Reader overflow buffer
  2. Fill internal buffer from the Reader input stream (reader.handle)
  3. Identify all complete record boundaries
  4. Push back incomplete record bytes to overflow

This approach still has the Reader act as a middle-man to collect incomplete records, but exposes a tunable optimization that can approach zero-copy performance. Specifically, you can optimize your RecordSet fill operation to minimally capture incomplete records, and as a result, minimally copy bytes between Reader and RecordSets.

// Example filled buffer in a RecordSet
| i.s.+.q | i.s.+.q | i.s.+.q | i.s.+.q | i.s.+.q | i.s.+.q | i
^complete ^complete ^complete ^complete ^complete ^complete ^incomplete

loop {
	// Push incomplete record back to Reader
	| i
	^ incomplete pushed back to reader

	// Next Record Set first fills from Reader
	| i
	^ incomplete pulled from reader

	// Then fills from input stream
	| i.s.+.q | i.s.+.q | i.s.+.q | i.s.+.q | i.s.+.q | i.s.+.q | i
	    ^ new bytes pulled from input stream                    ^ incomplete
}

Introducing paraseq§

This is the approach I’ve taken with paraseq, a minimal-copy parallel fastq/fasta parsing library. It was designed from the ground up to perform minimal-copy RecordSet parsing but additionally adds a maximum capacity of records per set. What this means is that each RecordSet will either have the maximum capacity of records or some value lower if it’s the end of the file. This lets you trivially process paired-end records in parallel because the two RecordSets will never be out of sync with one another.

Unfortunately, the fixed-buffer per RecordSet approach taken by seq_io and fastq-rs will never be able to guarantee that two files can be processed in-sync when operating at the RecordSet level. If, for example, the first file had 50bp sequences and the second had 100bp sequences you would not be able to fit as many complete records in the second RecordSet as you could the first (because the number of bytes per complete record is doubled).

paraseq addresses this by dynamically adjusting the RecordSet byte buffer to accommodate its expected capacity of records. i.e. if 1024 records are requested per filled RecordSet but one file expected 50bp sequences and the second 100bp sequences, then at runtime the RecordSet dynamically expands its internal buffer in the second case to be twice as large as the first case.

It then uses runtime statistics to optimize its minimal-copy operations. It tracks the average record size within RecordSets and approximates the number of bytes it needs to reach its maximum number of completed records. So if, for example, to accommodate a large sequence, the buffer needs to grow dramatically, it will not overfill its internal buffer and copy-back large slices of data for the remainder of the runtime.

paraseq was built mostly as a proof of concept and as a tool to manage paired parallel sequences, but as a standalone sequential processor actually performs as well as its zero-copy alternatives.

Benchmarking paraseq against other FASTQ parsers§

This is a benchmark I’ve created that parses 10M records with each file containing a different size sequence size per record. This is a minimal parsing test that parses records and counts the number of records in the file. For the source-code of the benchmark see the paraseq_benchmark repository.

This benchmark was taken on my local computer using a AMD Ryzen 7 7700X and an NVMe SSD.

Benchmark Throughput

We can see that paraseq matches the performance of the zero-copy parsers (needletail, seq_io, and fastq) but improves on the one-copy parsers used for parallel processing (seq_io and fastq).

Conclusion§

Modern FASTQ parsers represent different approaches to balancing memory allocation and copy operations. The traditional zero-copy approach works well for sequential processing but requires an extra copy step when parallelized. The minimal-copy approach presented here reduces these copy operations by dynamically managing buffer sizes and using runtime statistics to optimize record set boundaries.

The benchmarks demonstrate that this alternative architecture achieves comparable performance to existing zero-copy implementations while addressing some limitations in parallel paired-end processing. It suggests that minimizing copy operations can be a viable strategy for optimizing buffer management in parallel processing without sacrificing performance.

A key insight from this work is that tracking and adjusting to runtime characteristics (like average record size) can help optimize buffer management. This becomes particularly relevant when processing paired files with different sequence lengths, where fixed-size buffers may be sub-optimal. Currently paraseq uses a fairly naive approach at guessing expected record sizes but this could be optimized as well and eventually bring down copy-operation costs even further.

I hope that this optimization story was interesting to somebody out there in the void, and if you liked it, have questions, or ideas for improving paraseq, please feel free to reach out.