[ `fxtools` ]

Introduction

fxtools is a collection of fastx related utilities I've needed to write for one reason or another to perform various functions. The tool is a collection of sequence processing utilities written in rust and focusing generally on sequence transformations.

fastx is a term that means either FASTA or FASTQ. The utilities in this tool generally do not care which format the sequences are in, and will handle them appropriately.

All subcommands within the tool are documented here, but should be usable and understandable from their help menus.

fxtools has been written with the unix-philosophy in mind and most functions can be used with standard CLI piping.

Contributions

Contributions are very welcome - if you have something in mind feel free to submit an issue, a PR, or reach out to me.

Issues

Please address all issues to future contributors

Installation

This first requires installing the rust package manager cargo:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Afterwards we can install fxtools using cargo:

cargo install fxtools

[ `fxtools cat` ]

Summary

This command will concatenate multiple fastx files into a single stream.

Expected Input

`sample_1.fa`

>AP2S1_a
ACTG
>AP2S1_b
ACTT
>AP2S2_a
CCCT

`sample_2.fa`

>NSD1_a
ACTG
>NSD1_b
ACTT
>NSD2_a
CCCT

Expected Output

>AP2S1_a
ACTG
>AP2S1_b
ACTT
>AP2S2_a
CCCT
>NSD1_a
ACTG
>NSD1_b
ACTT
>NSD2_a
CCCT

Usage

# standard concatenation
fxtools cat -i <fastx> <...> <fastx>

# into pipeline
fxtools cat -i <fastx_1> <fastx_2> <fastx_3> | fxtools filter -p "ACT"

# print only sequences of records
fxtools cat -s -i <fastx> <...> <fastx>

# print only sequences of records and write as single-line
fxtools cat -Ss -i <fastx> <...> <fastx>

# print only headers of records
fxtools cat -s -i <fastx> <...> <fastx>

[ `fxtools clip` ]

Summary

This command will clip (or truncate) the records to only recover nucleotides within a desired range.

Expected Input

>AP2S1_a
ACTG
>AP2S1_b
ACTT
>AP2S2_a
CCCT

Expected Output

# trims 1 nucleotide from the start and 1 nucleotide from the end
fxtools clip -s 1 -e 1

>AP2S1_a
CT
>AP2S1_b
CT
>AP2S2_a
CC

Usage

# left side clip (10 nucleotides)
fxtools clip -i <fastx> -s 10

# right side clip (10 nucleotides)
fxtools clip -i <fastx> -e 10

# left side clip (5 nucleotides) right side clip (15) nucleotides
fxtools clip -i <fastx> -s 5 -e 15

# clip everything outside of nucleotide range 10-20
fxtools clip -i <fastx> -r 10..20

# clip everything outside range 10-end (equivalent to -s 10)
fxtools clip -i <fastx> -r 10..

# clip everything outside range start-10
fxtools clip -i <fastx> -r ..10

[ `fxtools count` ]

Summary

This command will just count the number of records in your input fastx

Expected Input

>AP2S1_a
ACTG
>AP2S1_b
ACTT
>AP2S2_a
CCCT

Expected Output

Usage

# standard counting
fxtools count -i <fastx>

# from pipeline
fxtools filter -i <fastx> -p "ACTCGCG" | fxtools count

[ `fxtools extract-variable` ]

Summary

This command will extract the variable regions from an input fastx and write those variable regions to the output fastx.

Expected Input Sequences

It was designed assuming that the sequences are all equal size and that they are prefixed and suffixed by a fairly static nucleotide region (consider CRISPRi/a libraries with a constant adapter sequence on either side of a highly variable region).

[prefix][variable][suffix]
[prefix][variable][suffix]
           ...
[prefix][variable][suffix]

Expected Output Sequences

The output sequences will extract just the positions of the input sequence that have a higher entropy than random chance.

[variable]
[variable]
   ...
[variable]

How it Works

This works by calculating the positional entropy across the nucleotides at each position, then applies a z-score threshold on those entropies to determine a contiguous variable region which is then used as the bounds to write the output sequences.

Parameters

Default will write to stdout, but you can provide an output file with the -o flag. You can decide how many sequences to calculate the entropy on with the -n flag. You can decide what z-score threshold to use for your data with the -z flag.

Note:

The z-score threshold default is arbitrarily set. If you have a smaller number of sequences try to reduce the threshold to 0.5, and see if that helps.

Usage

fxtools extract-variable \
  -i <input_fastx> \
  -o <output_fastx> \
  -n <number of sequences to use in fitting entropy [default: 5000]> \
  -z <zscore threshold to use [default: 1.]>

[ `fxtools filter` ]

Summary

This command will filter your input fastx into an output fastx and retrieve all the sequences or headers that match your input pattern.

This pattern is a regex compatible pattern, and can also be inverted with the -v flag (like grep -v).

Expected Input

This will reverse complement each of the sequences and potential quality scores.

>AP2S1_a
ACTG
>AP2S1_b
ACTT
>AP2S2_a
CCCT

Expected Output

Filter on Sequence

fxtools filter -i <fasta> -p "ACT"

>AP2S1_a
ACTG
>AP2S1_b
ACTT

Filter on Header

fxtools filter -i <fasta> -p "_a" -H

>AP2S1_a
ACTG
>AP2S2_a
CCCT

Inverse Filter

fxtools filter -i <fasta> -p "ACT" -v

>AP2S2_a
CCCT

Usage

# standard filtering (on sequence)
fxtools filter -i <fastx> -p <pattern>

# filtering on header
fxtools filter -i <fastx> -p <pattern> -H

# inverse filter (removing all records that match pattern)
fxtools filter -i <fastx> -p <pattern> -v

# inverse filter (removing all records that match pattern) on header
fxtools filter -i <fastx> -p <pattern> -v -H

[ `fxtools fix` ]

Summary

This command will replace all nucleotides not matching: [ACTGNactgn] with the missing nucleotide N.

Parameters

Default will write to stdout, but you can provide an output file with the -o flag.

Usage

fxtools fix \
  -i <input_fastx> \
  -o <output_fastx>

[ `fxtools sort` ]

Summary

This tool can be used to sort a fastx file or a pair of equivalently sized fastx files based on sequence.

If an R2 is provided, then the default is to sort on its sequences.

Usage

# Sort a single fastq
fxtools sort -i <your_file.fastq>

# Sort a paired-end fastq set by R2
fxtools sort -i <your_R1.fq.gz> -I <your_R2.fq.gz>

# Sort a paired-end fastq set by R1
fxtools sort -i <your_R1.fq.gz> -I <your_R2.fq.gz> --sort-by-r1

[ `fxtools reverse` ]

Summary

This command will convert your input fastx into an output fastx with all nucleotide sequences (and associated quality scores) in in reverse order.

Useful for grep for a sequence in R2 or vice versa.

Expected Input

This will reverse complement each of the sequences and potential quality scores.

ACTG
GCTA
AAAA

Expected Output

CAGT
TAGC
TTTT

Usage

fxtools reverse -i <your_fwd.fq.gz>

[ `fxtools sample` ]

Summary

This command will randomly downsample the number of records in your input fastx by some frequency f.

Expected Input

>AP2S1_a
ACTG
>AP2S1_b
ACTT
>AP2S2_a
CCCT
>AP2S2_b
CCCC

Expected Output

With frequency 0.75.

>AP2S1_a
ACTG
>AP2S2_a
CCCT
>AP2S2_b
CCCC

Usage

# standard sampling 50% of records
fxtools sample -i <fastx> -f 0.5

# from pipeline subsampling 30% of records
fxtools filter -i <fastx> -p "ACTCGCG" | fxtools sample -f 0.3

[ `fxtools sgrna-table` ]

Summary

This command will create a table mapping sgRNA names to their parent gene.

Expected Input

This works by parsing the header of each record and currently it expects the header to be as follows:

# {gene}_{auxilliary sgrna description}

Parameters

The command requires an input fasta/q file and will by default write a sgrna-to-gene table to stdout.

You can pipe the output table to a file with the -o flag.

You can also choose to include each records sequence with the -s flag.

You can also choose to reorder the columns to whatever format you'd like with the -r flag and provide a 3 character string (i.e. -r hsg or -r ghs) representing the [hH]eader, [sS]sequence, and [gG]ene.

By default the table's delimiter is tabs, but you can specify a separate delimiter with the -d flag.

Usage

fxtools sgrna-table \
  -i <input_fastx> \
  -o <s2g.txt> \
  -s \
  -r ghs \
  -d <character delim>

[ `fxtools t2g` ]

Summary

This tool can be used to parse a cDNA fasta file and build a mapping of transcripts to genes which is used in Kallisto.

You can decide to include the ensembl gene_id or the gene_name which is the common symbol of that gene.

Usage

# parse the t2g and write to stdout
fxtools t2g -i <your_seq.cdna.fasta.gz>

# parse the t2g and write to `t2g.txt`
fxtools t2g -i <your_seq.cdna.fasta.gz> -o t2g.txt

# parse the t2g and write the symbols instead of the gene_id
fxtools t2g -i <your_seq.cdna.fasta.gz> -s

# parse the t2g and include the gene_id version in the output
fxtools t2g -i <your_seq.cdna.fasta.gz> -s -d

[ `fxtools take` ]

Summary

This command will take a certain number of records from your input fastx

Expected Input

>AP2S1_a
ACTG
>AP2S1_b
ACTT
>AP2S2_a
CCCT

Expected Output (take 2)

>AP2S1_a
ACTG
>AP2S1_b
ACTT

Usage

# standard take (taking 3 records from the top)
fxtools take -i <fastx> -n 3

# standard take (skipping 2 and then taking 3 records)
fxtools take -i <fastx> -s 2 -n 3

# from pipeline
# filtering for a sequence pattern, and then taking the first 30 hits
fxtools filter -i <fastx> -p "ACTCGCG" | fxtools take -n 30

[ `fxtools trim` ]

Summary

This tool can be used to select and trim sequences that contain a specified adapter sequence.

This can almost be thought of as a combination of grep and sed, where everything before the grep match is removed.

By default the adapter sequence is kept (the adapter will be a common prefix for all kept reads) but it can also be trimmed away.

Usage

# trim away sequences prefixing the adapter
fxtools trim -i <your_seq.fq.gz> -a ACTTGGA

# trim away sequences prefixing the adapter + the adapter
fxtools trim -i <your_seq.fq.gz> -a ACTTGGA --trim-adapter

[ `fxtools unique` ]

fxtools unique \
  -i <input_fastx> \
  -o <optional_output_file_for_unique> \
  -n <optional_output_file_for_null>

[ `fxtools upper` ]

fxtools upper \
  -i <input_fastx> \
  -o <output_fastx>