Releasing GIA v0.2

Releasing GIA v0.2§

I've been working on updating gia for the past few months on and off between experiments and I'm super excited to release a new version of it with oodles of updates!

My focus for this release was increasing the overall utility of gia and for feature parity with bedtools. Specifically that meant increased utility count, stranded methods, and file format support. I'll be working on updating all the HTML-base documentation for these new tools soon, but from the CLI --help many of them should explain themselves.

Highlights§

  1. Native support for all interval files (BED{3,4,5,6,12}, BED-like, GTFs)
  2. Specialized functions for HTSlib data (BAM / VCF)
  3. Expanded list of utilities (22 total functions!)
  4. Stranded methods for many of those utilities with specialized functions
  5. Arbitrary number of secondary inputs to relevant utilities.
  6. Auto-determined file formats and mixed file-formats.

Benchmarks§

Benchmarking details can be found in a separate github repo, but overall for all the new utilities, gia is faster than bedtools and bedops.

Global Speedup§

Across all the new implemented utilities, gia is highly performant against the current tools bedtools and bedops. The magnitude of speedup ranges from 2x-16x depending on the utility and the number of intervals being processed.

Caveat: I am testing interval sets in the 2-5 million range, and bedops does start to perform better at the 100 million range, but there are reasons dealing with the way that intervals are merged that make it a bit of an unfair direct comparison because much fewer intervals are actually being written, which accounts for a large portion of the time. I'll write a more detailed post on this in the future and include some benchmarking on that topic as well.

BAM Speedup§

For BAM operations, gia is significantly faster than bedtools. I take advantage of the very powerful C-library htslib using the rust crate rust-htslib which provides bindings to the C-library. This also lets me take advantage of multi-threading, which is a huge boon for BAM operations, and the speedup is clear.

Caveat: bedtools makes no claims on optimized BAM operations, but a comparison is included for completeness and to showcase the generalization of interval operations for common tasks on multiple file types.

Stranded Operations§

One of the most exciting parts of this release for me is the addition of stranded methods to a large number of utilities. This is a feature that is present in bedtools, but not in bedops, and is incredibly useful for my own work with splicing.

We can see that gia is significantly faster than bedtools in stranded operations, and the speedup is consistent across all utilities, but notably there is a difference between stranded operations and non-stranded operations in both gia and bedtools. This makes sense, because they add an additional layer of complexity to the operation, but I think that this can be optimized further and will likely be something I work on in a future release of bedrs.

What's Next§

For those wondering what's next for gia, it'll be an expanded support for streamed methods! At extremely large file sizes, for many utilities gia will still require loading the entire file into memory before running the operation. I'd like to have an optional streaming operation for all utilities like bedops, which does a great job of a fully streamed design, avoiding allocation cost. Current streaming support in gia is limited to a few utilities, and their file support is lacking as well. There are some ideas I've explored with this a few months ago, but decided to back-burner it while I worked on the other features in this release.

Dev Notes§

For the rust-devs that are interested, I can say that I've absolutely fallen in love with both the declarative and procedural macro system. It has shaved off thousands of lines of code in both this library and the bedrs library, and in my opinion, is the perfect foil to generics.

I've found with a fully generic library, like bedrs, you end up relying a lot on Enums, which are great (don't get me wrong), but you can end up with match arms that look almost identical except for a single do_something_for_{} function, which varies by the enum class.

/// Something like this
match enum_class {
  MyEnum::A => do_something_for_a(),
  MyEnum::B => do_something_for_b(),
  MyEnum::C => do_something_for_c(),
}

This isn't a big deal when your enum only has a couple attributes, but as you scale up your number of classes, and you start adding combinations of classes, the number of match arms skyrockets and it is a major PITA.

/// Kill me now
match enum_class_a {
  MyEnum::A => match enum_class_b {
    MyEnum::A => do_something_for_aa(),
    MyEnum::B => do_something_for_ab(),
    MyEnum::C => do_something_for_ac(),
  },
  MyEnum::B => match enum_class_b {
    MyEnum::A => do_something_for_ba(),
    MyEnum::B => do_something_for_bb(),
    MyEnum::C => do_something_for_bc(),
  },
  MyEnum::C => match enum_class_b {
    MyEnum::A => do_something_for_ca(),
    MyEnum::B => do_something_for_cb(),
    MyEnum::C => do_something_for_cc(),
  },
}

Lets say you're crazy enough to do it, but then suddenly you change a function signature slightly, and boom, you have to make an edit to 500 different lines of code.

Macros are the way.

You can generate all those match arms, and their combinations, declaratively with only a few lines of code, and at compile time it just expands into that massive block of match statements. Then the shared functionality between all those arms can be modified in a single place!

macro_rules! dispatch {
    ($my_enum:expr) => {{
        build_cases!(
          $my_enum,
          (A, do_something_for_a),
          (B, do_something_for_b),
          (C, do_something_for_c)
        )
    }};
}
macro_rules! build_cases {
    ($my_enum:expr, $( ($fmt:ident, $method:ident) ),* ) => {
        match $my_enum {
            $(
            
                MyEnum::$fmt => $method(),
            )*
        }
    };
}

This is a simple example, and in this case you could just add a method to the Enum, but there are many cases where you can't do that for static typing reasons, and this is where macros shine. You can play around with this example on the rust playground if you're interested in exploring.

This was a game-changer for me and I loved it so much it became the core dispatch system in gia.

Conclusion§

This release was a huge step forward for gia and I'm excited to see where it goes from here. Next up is a focus on streaming methods and a few more utilities that I think will be useful for the bioinformatics community.

Thanks for reading and happy coding!