yotiao
yotiao

Reputation: 273

Turning multi-fasta file into set of single-line sequences

I have a multi-fasta sequence file (there is a newline character at the end of each line):

>M3559
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG
CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>M9171
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG
CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
>M4692
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG
CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC

I would like to turn it into a file where each sequence is in a single line, with sequence name followed by tab:

>M3559 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCATTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>M9171 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCATTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
>M4692 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCATTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC

I got to the point where I removed all the newline characters by simple:

awk 1 ORS='' test.txt

But I now need to place a newline character in the beginning of each sequence name (so substitute > with \n>)

tr ">" "\n"

(although this removes the >, and ideally I would like to keep it, but it's not a big deal)

and add a \t after the sequence name, which I can capture with a regular expression.

^>M[0-9]{4}

And this is this last bit I struggle with - how do I add a character after a regex-ed string in a file? Suggestions will be greatly appreciated :-)

yot

UPDATE: below I paste the output of the various commands suggested by others on my test input file.

UPDATE 2: Fredrik's answer works if you use gnu sed instead of the default sed on a Mac. Please see my comment under Fredrik's answer.

Running:

awk -v RS='\n>' -v ORS='\n>' -v OFS='' -F'\n' '{$1=$1 "\t"}1' file

on my input produces:

>M3559
>GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
>TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG
>CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTA
>CTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTG
>>M9171
>GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
>TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG
>CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
>ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAGCATACTTA
>CTAAAGTGTGTTAGTTAATTAATGCTTGTAGGACATAATAATAACAATTG
>>M4692
>GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
>TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG
>CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTA
>CCAAAATGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTG

Running:

echo $(cat test.txt) | sed 's/>/\n>/2g' | sed 's/ //2g' | sed 's/ /\t/g'

produces nothing (no output).

I am not running paste -d " " - - - - < input as numbers of line for each sequence is different in my input.

But running:

awk 'NR%4{printf $0" ";next;}1' input

Produces:

>M3559 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTA CTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTG 
>M9171 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAGCATACTTA CTAAAGTGTGTTAGTTAATTAATGCTTGTAGGACATAATAATAACAATTG
>M4692 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTA CCAAAATGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTG

and then running sed 's/ \+/ /' | tr -d ' ' does not help...

Upvotes: 2

Views: 1511

Answers (3)

Svperstar
Svperstar

Reputation: 497

This might not be very elegant but I think it does what you want:

echo $(cat test.txt) | sed 's/>/\n>/2g' | sed 's/ //2g' | sed 's/ /\t/g'

Explained:

echo $(cat test.txt) will linearize the file

sed 's/>/\n>/2g' - places a '\n' before '>' (from 2nd ocurrence on)

sed 's/ //2g' - will delete the spaces after the first occurrence

sed 's/ /\t/g' - replace the only space left for a tab

Let me know if it worked!

Upvotes: 0

Fredrik Pihl
Fredrik Pihl

Reputation: 45662

If the input is as well formated as above, you can use paste

$ paste -d " " - - - - < input
>M3559 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>M9171 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
>M4692 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC

or awk:

$ awk 'NR%4{printf $0" ";next;}1' input
>M3559 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>M9171 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
>M4692 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC

To remove spaces and to have a tab after the id, pipe everything to

sed 's/ \+/ /' | tr -d ' '

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

You can do that with awk:

awk -v RS='\n>' -v ORS='\n>' -v OFS='' -F'\n' '{$1=$1 "\t"}1' file

The idea is to set the input and output records separator to \n> and the fields separator to \n. With this setting, the first field is the sequence name. All you need is to set the output fields separator to the empty string and to append a tab character at the end of this field.

Upvotes: 0

Related Questions