BINSEQ: Addressing fundamental problems in sequence analysis

BINSEQ: Addressing fundamental problems in sequence analysis§

Recently we released a preprint describing binseq - a new file format for sequencing data that we developed in the course of building new sequence processing tools at Arc.

Is a format like BINSEQ doomed from the start?§

The release has gotten a lot of excitement but I also think that a community transition from FASTQ -> BINSEQ (or any other format) is incredibly challenging because its a bit of a collective action problem and a catch-22.

Everything is currently written around FASTQ, and it would be a nontrivial task to go back and rewrite core tools to accept new formats - especially a format without widespread community adoption. It feels not worth it to rewrite tools to accept a niche format, but it is likely to remain a niche format because no tools accept it.

Here's the thing.

I don't think I can convince you to adopt a new format.

I do think that I can show examples of how BINSEQ solves fundamental problems of FASTQ and how solving those problems leads to substantial performance gains.

Then - it's up to you.

The example: a sequence grep§

If you work in bioinformatics - here's something you might do all the time.

gunzip -c <fastq.gz> | grep <some_sequence>

Now - how long would this take to run?

For a random 10M record fastq.gz with 100bp per record I'm clocking it at about 5.6s.

That's a long time - especially when you consider that the dataset you're actually going to be running it on will likely be in the billions or tens of billions of records.

That 5 seconds you're seeing now on this small dataset - well that might force a lunch-break on the full dataset.

We can solve this - we have the technology§

There are plenty of tools that have been developed over the years that do this type of operation better.

To name a few - there's seqkit, fqgrep, grepq, and plenty of other codebases in various languages that tackle this problem.

For the purposes of this demo I've also built up my own, seqpls which makes use of my FASTQ parsing library paraseq.

Notably, I built seqpls to have the exact same matching implementation as the grep subcommand in bqtools - the CLI utility I built for working with BINSEQ files.

This lets us directly see the difference between running the same matching algorithm with different accession patterns - i.e. FASTQ or BINSEQ.

Profiling sequence greppers - BINSEQ shows 20x to 80x improvement over FASTQ-based tools§

For the purposes of this demo I'm going to profile the naive implementation above, seqkit, fqgrep, grepq, seqpls, and bqtools.

Notably - many of these tools allow for multithreaded processing, so I am going to run them with 16 threads and use hyperfine to benchmark. We'll use my random sequence generator, nucgen to draw some random sequences.

# Generate random sequences, make a binseq file, prepare regex for grepq
nucgen -n 10000000 -l 100 some.fq.gz \
	&& bqtools encode -T3 some.fq.gz -o some.bq \
	&& echo "ACGTACGT" > patterns.txt

hyperfine \
	-n "naive" \
		"gunzip -c some.fq.gz | grep 'ACGTACGT' -c" \
	-n "seqkit" \
		"seqkit grep -s -C -m0 -P -j16 -p 'ACGTACGT' some.fq.gz" \
	-n "fqgrep" \
		"fqgrep -t16 -F -c -e 'ACGTACGT' some.fq.gz" \
	-n "grepq" \
		"grepq --read-gzip --count patterns.txt some.fq.gz" \
	-n "seqpls" \
		"seqpls -T16 -c -e 'ACGTACGT' some.fq.gz" \
	-n "bqtools" \
		"bqtools grep -T16 -C -e 'ACGTACGT' some.bq";
Benchmark 1: naive
  Time (mean ± σ):      5.695 s ±  0.009 s    [User: 5.215 s, System: 0.885 s]
  Range (min … max):    5.678 s …  5.709 s    10 runs

Benchmark 2: seqkit
  Time (mean ± σ):      3.144 s ±  0.018 s    [User: 4.536 s, System: 0.079 s]
  Range (min … max):    3.127 s …  3.181 s    10 runs

Benchmark 3: fqgrep
  Time (mean ± σ):      1.379 s ±  0.009 s    [User: 2.202 s, System: 0.136 s]
  Range (min … max):    1.363 s …  1.399 s    10 runs

Benchmark 4: grepq
  Time (mean ± σ):      1.442 s ±  0.009 s    [User: 1.891 s, System: 0.132 s]
  Range (min … max):    1.429 s …  1.457 s    10 runs

Benchmark 5: seqpls
  Time (mean ± σ):      1.295 s ±  0.007 s    [User: 1.632 s, System: 0.093 s]
  Range (min … max):    1.286 s …  1.310 s    10 runs

Benchmark 6: bqtools
  Time (mean ± σ):      70.2 ms ±   5.5 ms    [User: 646.1 ms, System: 42.6 ms]
  Range (min … max):    62.3 ms …  79.0 ms    46 runs

Summary
  bqtools ran
   18.44 ± 1.46 times faster than seqpls
   19.65 ± 1.56 times faster than fqgrep
   20.54 ± 1.63 times faster than grepq
   44.79 ± 3.55 times faster than seqkit
   81.13 ± 6.41 times faster than naive

bqtools ran a whopping 81x faster than the naive CLI implementation - and ~20x faster than the new blazingly fast rust implementations (seqpls included).

Sure - seqpls is using some nice optimizations for performance and even beating out fqgrep and grepq (at least in this single-expression case) - but 18x faster with the same implementation? That's just BINSEQ dunking on FASTQ.

So what's the diff - is BINSEQ just another blazingly fast parser written in rust that has a spiffy new colored --help menu?

Well it does have a nicely colored CLI - and I'm not convinced the colors don't make it go faster -- but the real gains come from actually using our hardware well.

Parallel in name only - let's talk about producer-consumer models§

Remember when I said that many of these tools allow multiprocessing and I gave them 16 threads?

Well they're just not using the threads well.

It's not an implementation error, or lazy devs not micro-benchmarking to the right mutex (check out parking_lot for the fastest mutex in town), the issue is more systematic.

Because FASTQ as a format doesn't enforce records to be a specific size, and because it's too large of a file to ever keep around uncompressed, parsing it has a lot of ambiguities - specifically around:

  1. How many records there are in the file
  2. Where every record is in the file.

As a result, it needs to be parsed as its streamed + decompressed, an inherently single-threaded operation.

This forces a parallel architecture called a single-producer-multi-consumer (SPMC) model. Where you assign one thread to handle the IO operations (decompression + parsing) and load data to each consumer thread which actually perform the interesting per-sequence task (in this case, grep).

                   | Consumer |
| Producer |  =>   | Consumer |
                   | Consumer |

This is the type of approach taken by fqgrep and seqpls - but is the most prevalent parallel architecture for processing sequencing data across the bioinformatics ecosystem - it is not limited just to these sequence greppers.

What the flux is an unbalanced SPMC?§

A common problem of SPMC architectures is task imbalance - where performance saturates as the consumer task becomes faster than the producer task.

These sequence-greppers are (coincidentally...) a perfect example of this behavior.

Their producer-consumer model is highly unbalanced. A pattern matching or regular expression evaluation on sequences is a super cheap operation, and as such the consumer is significantly faster than the producer.

So as you provide more and more threads to your tool, most of the time those threads are sitting around waiting for their turn to pull data from the producer and you get performance saturation.

They effectively take turns processing the data instead of processing the data in parallel.

In the case of very-cheap-per-record-operations you won't really see improvement past 2 threads. You get an improvement going from 1 -> 2 (separating the IO from the matching) but after that your consumer threads basically act in a round robin and can't really parallelize tasks.

fqgrep demonstrates this behavior exactly:

hyperfine \
	-n "fqgrep-1-thread" \
		"fqgrep --threads 1 -c -e 'ACGTACGT' some.fq.gz" \
	-n "fqgrep-2-thread" \
		"fqgrep --threads 2 -c -e 'ACGTACGT' some.fq.gz" \
	-n "fqgrep-4-thread" \
		"fqgrep --threads 4 -c -e 'ACGTACGT' some.fq.gz" \
	-n "fqgrep-8-thread" \
		"fqgrep --threads 8 -c -e 'ACGTACGT' some.fq.gz";
Benchmark 1: fqgrep-1-thread
  Time (mean ± σ):      2.031 s ±  0.015 s    [User: 1.937 s, System: 0.151 s]
  Range (min … max):    2.006 s …  2.049 s    10 runs

Benchmark 2: fqgrep-2-thread
  Time (mean ± σ):      1.384 s ±  0.006 s    [User: 1.959 s, System: 0.149 s]
  Range (min … max):    1.374 s …  1.395 s    10 runs

Benchmark 3: fqgrep-4-thread
  Time (mean ± σ):      1.384 s ±  0.018 s    [User: 2.025 s, System: 0.133 s]
  Range (min … max):    1.373 s …  1.433 s    10 runs

Benchmark 4: fqgrep-8-thread
  Time (mean ± σ):      1.375 s ±  0.009 s    [User: 2.041 s, System: 0.130 s]
  Range (min … max):    1.359 s …  1.387 s    10 runs

Summary
  fqgrep-8-thread ran
    1.01 ± 0.01 times faster than fqgrep-2-thread
    1.01 ± 0.01 times faster than fqgrep-4-thread
    1.48 ± 0.01 times faster than fqgrep-1-thread

There's some slight variation on performance going from 2 -> 4 -> 8, but its small potatoes compared to the gains from 1 -> 2. We've effectively saturated - so we won't see any performance boosts by adding more threads.

[!note] Just to be clear - I'm not trying to pick on fqgrep. It's a great tool made by excellent bioinformaticians. The problem I'm describing is pervasive and not unique to a single tool.

Getting rid of the single-producer model with BINSEQ§

This is where BINSEQ really shines - because each record is fixed size and the 2-bit encoding inherently shrinks the file size so it doesn't require additional compression - it addresses the major limitations of FASTQ so that:

  1. We know a priori how many records there are in the file
  2. We have random lock-free access to each record .

So parallel processing becomes trivial - each thread knows exactly what records it needs to process and they can directly access those records in a lock-free manner - effectively removing the single-producer-multi-consumer (SPMC) model and transforming it into a multi-producer-multi-consumer (MPMC) model.

| Producer |   =>   | Consumer |
| Producer |   =>   | Consumer |
| Producer |   =>   | Consumer |

Technically each consumer thread is its own producer, so we don't see the round-robin behavior where each thread is waiting around for the single-producer to get to them. Instead, each thread just pulls the records it needs when it needs them and we get true parallelization.

We can clearly see its performance scale with the number of threads:

hyperfine \
	-n "bqtools-1-thread" \
		"bqtools grep --threads 1 -C -e 'ACGTACGT' some.bq" \
	-n "bqtools-2-thread" \
		"bqtools grep --threads 2 -C -e 'ACGTACGT' some.bq" \
	-n "bqtools-4-thread" \
		"bqtools grep --threads 4 -C -e 'ACGTACGT' some.bq" \
	-n "bqtools-8-thread" \
		"bqtools grep --threads 8 -C -e 'ACGTACGT' some.bq" \
	-n "bqtools-16-thread" \
		"bqtools grep --threads 16 -C -e 'ACGTACGT' some.bq"
Benchmark 1: bqtools-1-thread
  Time (mean ± σ):     519.5 ms ±  80.8 ms    [User: 468.8 ms, System: 46.1 ms]
  Range (min … max):   478.7 ms … 701.7 ms    10 runs

Benchmark 2: bqtools-2-thread
  Time (mean ± σ):     249.9 ms ±   1.5 ms    [User: 447.5 ms, System: 32.2 ms]
  Range (min … max):   247.9 ms … 252.4 ms    11 runs

Benchmark 3: bqtools-4-thread
  Time (mean ± σ):     147.8 ms ±  23.2 ms    [User: 465.7 ms, System: 33.3 ms]
  Range (min … max):   133.7 ms … 190.7 ms    22 runs

Benchmark 4: bqtools-8-thread
  Time (mean ± σ):      85.8 ms ±  12.9 ms    [User: 472.9 ms, System: 35.8 ms]
  Range (min … max):    76.3 ms … 107.9 ms    38 runs

Benchmark 5: bqtools-16-thread
  Time (mean ± σ):      64.0 ms ±   5.5 ms    [User: 654.9 ms, System: 39.1 ms]
  Range (min … max):    59.0 ms …  75.2 ms    44 runs

Summary
  bqtools-16-thread ran
    1.34 ± 0.23 times faster than bqtools-8-thread
    2.31 ± 0.41 times faster than bqtools-4-thread
    3.91 ± 0.33 times faster than bqtools-2-thread
    8.12 ± 1.44 times faster than bqtools-1-thread

Consider these results for a moment - operations like pattern matching (incredibly cheap) can continue to scale up to 16 threads while FASTQ-based tools are stalling out at 2 threads.

Though it's not just a multi-threaded gain, even at a single-thread we can see that binseq is at least 4x faster than FASTQ.

Yeah but what about the time it takes to convert from FASTQ => BINSEQ?§

Conversion is a one-time cost, and it's actually itself an unbalanced SPMC model.

The 2-bit encoding operation is the only real cost, and it's not heavy enough to bottleneck the producer (see bitnuc for the SIMD implementation).

It takes about as much time (or even less) as just unzipping the fastq:

hyperfine \
	-n "gunzip-wc" \
		"gunzip -c some.fq.gz | wc -l" \
	-n "gunzip-null" \
		"gunzip -c some.fq.gz > /dev/null" \
	-n "convert" \
		"bqtools encode -T3 some.fq.gz -o some.bq";
Benchmark 1: gunzip-wc
  Time (mean ± σ):      5.687 s ±  0.027 s    [User: 5.273 s, System: 0.902 s]
  Range (min … max):    5.665 s …  5.757 s    10 runs

Benchmark 2: gunzip-null
  Time (mean ± σ):      5.294 s ±  0.009 s    [User: 5.235 s, System: 0.040 s]
  Range (min … max):    5.281 s …  5.306 s    10 runs

Benchmark 3: convert
  Time (mean ± σ):      1.475 s ±  0.014 s    [User: 1.998 s, System: 0.542 s]
  Range (min … max):    1.460 s …  1.509 s    10 runs

Summary
  convert ran
    3.59 ± 0.04 times faster than gunzip-null
    3.86 ± 0.04 times faster than gunzip-wc

Now for something completely different - paired-end grep§

BINSEQ addresses the sequential access pattern of FASTQ for big gains - lets talk now about another problem of FASTQ that BINSEQ addresses - paired sequences.

Lets keep with our grep example - but lets try a less trivial task that maybe you've run into.

Let's say you're working with paired-end data, you have two FASTQs that are paired. The way we represent this with FASTQ data is we have two files (R1 and R2) with the same number of sequences and we keep paired sequences in sync by keeping the ordering of those files fixed.

So the 10th sequence in R1 corresponds the the 10th sequence in R2 - those sequences act as a record, and oftentimes we process the record as a unit.

So lets say I want to inspect my sequences and I want to pull all records that match an expression in R1 and match a different expression in R2.

This is more challenging - we have to find a solution that keeps two compressed files in sync and evaluate different operations on each file.

The greybeard solution - just use awk§

Let's say you're an absolute one-liner legend - you might look at this problem and say:

"I can use awk for this."

Kudos. You're right. Here is a way to run this operation as a one-liner:

paste \
	<(gunzip -c r1.fq.gz | paste - - - -) \
	<(gunzip -c r2.fq.gz | paste - - - -) | \
	awk '$2 ~ /ACGTACGTACGT/ && $6 ~ /ACGTACGT/ {print}' | wc -l

This is by no means a simple one-liner to pull out of your hat (unless of course it is a wizard hat), but it does do what we want it to do.

We'll call this the naive solution (and put it in a file oneliner.sh) to benchmark against it.

Benchmarking paired greppers - BINSEQ shows 30x-100x improvement§

Unfortunately the suite of tools we can use to benchmark this type of operation is more limited. seqkit and grepq weren't built for this task, and for some reason fqgrep doesn't run fast enough in --paired mode for me to think it's a fair comparison (maybe a bug?)

For our FASTQ benchmark we'll use seqpls, which as I mentioned before follows the exact same matching algorithm as bqtools grep, so it'll be a head-to-head comparison between formats.

Again we'll use nucgen to generate 10M random sequences and use hyperfine to benchmark.

# Generate random sequences and build the binseq file
nucgen -n 10000000 -l 50 r1.fq.gz \
	&& nucgen -n 10000000 -l 150 r2.fq.gz \
	&& bqtools encode r1.fq.gz r2.fq.gz -o paired.bq -T3;

# Run benchmark
hyperfine \
	-n "naive" \
		"./oneliner.sh" \
	-n "seqpls" \
		"seqpls -T16 -c -e 'ACGTACGTACGT' -E 'ACGTACGT' r1.fq.gz r2.fq.gz" \
	-n "bqtools" \
		"bqtools grep -T16 -C -e 'ACGTACGTACGT' -E 'ACGTACGT' paired.bq";
Benchmark 1: naive
  Time (mean ± σ):      8.273 s ±  0.034 s    [User: 7.355 s, System: 3.641 s]
  Range (min … max):    8.250 s …  8.338 s    10 runs

Benchmark 2: seqpls
  Time (mean ± σ):      2.553 s ±  0.013 s    [User: 2.863 s, System: 0.144 s]
  Range (min … max):    2.536 s …  2.576 s    10 runs

Benchmark 3: bqtools
  Time (mean ± σ):      80.6 ms ±   5.8 ms    [User: 775.0 ms, System: 63.1 ms]
  Range (min … max):    75.1 ms …  98.7 ms    39 runs

Summary
  bqtools ran
   31.67 ± 2.30 times faster than seqpls
  102.63 ± 7.46 times faster than naive

We're seeing ~100x faster than the awk-wizard solution and ~30x faster than our efficient FASTQ based grepper with the same matching algorithm.

We even see an improvement delta of 10x between the single-end and paired-end comparisons of bqtools and seqpls. This clearly shows that record synchronization is adding quite a bit of overhead.

Everything in its right place§

So how does BINSEQ get around this problem? It keeps paired sequences right next to each other in the same file. Each BINSEQ record can either be just a single sequence, or it can be two sequences, and there's never any ambiguity about which sequence is which or misinterpreted as a file with twice as many sequences (like an interleaved FASTQ).

This is not just more convenient to work with, since when you query a record both sequences are returned together, but it's also much more computationally efficient because we never need to manage two file handles or keep them in sync - which becomes very non-trivial when moving a task to a parallel architecture.

In Conclusion§

Let's recap what we've learned: BINSEQ isn't just marginally better than FASTQ - it's dramatically better in ways that matter for real bioinformatics work:

  • Speed: 20-100x faster runtime on real-world tasks
  • Scalability: Actually uses all those CPU cores you're paying for instead of maxing out at 2 threads
  • Paired-end handling: No more juggling multiple files and complex synchronization
  • Conversion: Faster than just unzipping your FASTQ files (a one-time cost for major ongoing gains)

At a practical level, this means your 30-minute grep operation becomes a 30-second one. Your overnight analysis pipeline might finish before lunch (or even in the time it takes you to grab a coffee). That multi-GB memory footprint shrinks dramatically because you're not storing everything in giant buffers waiting for the IO bottleneck.

I get it though - I can't magically make the entire bioinformatics ecosystem switch formats overnight. The catch-22 is real. Tool developers won't support a format nobody uses, and nobody uses a format tools don't support.

But for those of you building new tools or maintaining existing ones - BINSEQ offers a clean, simple API that makes integration straightforward. The performance gains are so substantial that early adopters get a serious competitive advantage. Your tools will be the fastest in their class, and users will notice.

For individual researchers or labs - even if you just use the bqtools CLI directly as a converter and processor, you're getting access to blazingly fast sequence operations that can dramatically speed up your exploratory data analysis. Convert once, benefit repeatedly.

For larger institutions and data providers - consider offering pre-converted BINSEQ files (or VBINSEQ for variable length + quality scores) alongside traditional formats. You'd be giving your users a choice that could save them thousands of compute hours.

I've laid out the evidence. I've shown you the benchmarks. I've explained the fundamental architectural advantages. The rest is up to you.

If you want to dive in, check out the repo - PRs and feature requests are always welcome. And if you integrate BINSEQ into your own tools, let us know! We're building a list of compatible software to help drive adoption.

FASTQ has served us well, but we can do better. I've been through the looking glass and to me there's no going back.