Reputation: 273
I have a multi-fasta sequence file (there is a newline character at the end of each line):
>M3559
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG
CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>M9171
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG
CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
>M4692
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG
CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
I would like to turn it into a file where each sequence is in a single line, with sequence name followed by tab:
>M3559 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCATTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>M9171 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCATTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
>M4692 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCATTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
I got to the point where I removed all the newline characters by simple:
awk 1 ORS='' test.txt
But I now need to place a newline character in the beginning of each sequence name (so substitute > with \n>)
tr ">" "\n"
(although this removes the >, and ideally I would like to keep it, but it's not a big deal)
and add a \t after the sequence name, which I can capture with a regular expression.
^>M[0-9]{4}
And this is this last bit I struggle with - how do I add a character after a regex-ed string in a file? Suggestions will be greatly appreciated :-)
yot
UPDATE: below I paste the output of the various commands suggested by others on my test input file.
UPDATE 2: Fredrik's answer works if you use gnu sed instead of the default sed on a Mac. Please see my comment under Fredrik's answer.
Running:
awk -v RS='\n>' -v ORS='\n>' -v OFS='' -F'\n' '{$1=$1 "\t"}1' file
on my input produces:
>M3559
>GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
>TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG
>CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTA
>CTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTG
>>M9171
>GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
>TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG
>CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
>ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAGCATACTTA
>CTAAAGTGTGTTAGTTAATTAATGCTTGTAGGACATAATAATAACAATTG
>>M4692
>GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
>TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG
>CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTA
>CCAAAATGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTG
Running:
echo $(cat test.txt) | sed 's/>/\n>/2g' | sed 's/ //2g' | sed 's/ /\t/g'
produces nothing (no output).
I am not running paste -d " " - - - - < input
as numbers of line for each sequence is different in my input.
But running:
awk 'NR%4{printf $0" ";next;}1' input
Produces:
>M3559 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTA CTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTG
>M9171 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAGCATACTTA CTAAAGTGTGTTAGTTAATTAATGCTTGTAGGACATAATAATAACAATTG
>M4692 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTA CCAAAATGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTG
and then running sed 's/ \+/ /' | tr -d ' '
does not help...
Upvotes: 2
Views: 1511
Reputation: 497
This might not be very elegant but I think it does what you want:
echo $(cat test.txt) | sed 's/>/\n>/2g' | sed 's/ //2g' | sed 's/ /\t/g'
Explained:
echo $(cat test.txt)
will linearize the file
sed 's/>/\n>/2g'
- places a '\n'
before '>'
(from 2nd ocurrence on)
sed 's/ //2g'
- will delete the spaces after the first occurrence
sed 's/ /\t/g'
- replace the only space left for a tab
Let me know if it worked!
Upvotes: 0
Reputation: 45662
If the input is as well formated as above, you can use paste
$ paste -d " " - - - - < input
>M3559 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>M9171 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
>M4692 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
or awk
:
$ awk 'NR%4{printf $0" ";next;}1' input
>M3559 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>M9171 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
>M4692 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
To remove spaces and to have a tab after the id, pipe everything to
sed 's/ \+/ /' | tr -d ' '
Upvotes: 1
Reputation: 89557
You can do that with awk:
awk -v RS='\n>' -v ORS='\n>' -v OFS='' -F'\n' '{$1=$1 "\t"}1' file
The idea is to set the input and output records separator to \n>
and the fields separator to \n
. With this setting, the first field is the sequence name. All you need is to set the output fields separator to the empty string and to append a tab character at the end of this field.
Upvotes: 0