Collapse

What it does

This tool collapses identical sequences in a FASTA file into a single sequence.


Example

Example Input File (Sequence "ATAT" appears multiple times):

>CSHL_2_FC0042AGLLOO_1_1_605_414
TGCG
>CSHL_2_FC0042AGLLOO_1_1_537_759
ATAT
>CSHL_2_FC0042AGLLOO_1_1_774_520
TGGC
>CSHL_2_FC0042AGLLOO_1_1_742_502
ATAT
>CSHL_2_FC0042AGLLOO_1_1_781_514
TGAG
>CSHL_2_FC0042AGLLOO_1_1_757_487
TTCA
>CSHL_2_FC0042AGLLOO_1_1_903_769
ATAT
>CSHL_2_FC0042AGLLOO_1_1_724_499
ATAT

Example Output file:

>1-1
TGCG
>2-4
ATAT
>3-1
TGGC
>4-1
TGAG
>5-1
TTCA

Original Sequence Names / Lane descriptions (e.g. "CSHL_2_FC0042AGLLOO_1_1_742_502") are discarded.

The output sequence name is composed of two numbers: the first is the sequence's number, the second is the multiplicity value.

The following output:

>2-4
ATAT

means that the sequence "ATAT" is the second sequence in the file, and it appeared 4 times in the input FASTA file.