Amaranta_Remedios
Amaranta_Remedios

Reputation: 773

How to add substring to some (not all) fasta headers

I have a fasta file that looks like this:

>miR-92|LQNS02278089.1_34108_3p  Parhyale hawaiensis 34108_3p 
AATTGCACTCGTCCCGGCCTGC
>miR-92|LQNS02278089.1_34106_3p  Parhyale hawaiensis 34106_3p 
AATTGCACTGATCCCGGCCTGC
>LQNS02136402.1_14821_5p  Parhyale hawaiensis 14821_5p 
CCGTAAGGCCGAAGACAAGAA
>LQNS02278094.1_35771_5p  Parhyale hawaiensis 35771_5p 
AAGAATAAGCCCGAGCAAGTCGAT

I want to change the headers to make them look like this:

>miR-92|LQNS02278089.1_34108_3p  Parhyale hawaiensis 34108_3p 
AATTGCACTCGTCCCGGCCTGC
>miR-92|LQNS02278089.1_34106_3p  Parhyale hawaiensis 34106_3p 
AATTGCACTGATCCCGGCCTGC
>miR-LQNS02136402.1_14821_5p  Parhyale hawaiensis 14821_5p 
CCGTAAGGCCGAAGACAAGAA
>miR-LQNS02278094.1_35771_5p  Parhyale hawaiensis 35771_5p 
AAGAATAAGCCCGAGCAAGTCGAT

Note that not all the headers changed, just the last 2 in the example, where the word "miRs" was added. So far I have been doing this like this: perl -p -e "s/^>/>miR-/g" seq.fasta But this will end up with some IDs having miR- added even though they already had it.

I know I can subset the file and apply this to just the ones missing the miR- at the beginning and then remerge but I would like to find an easier way to do it in one line without much manual intervention.

Upvotes: 1

Views: 98

Answers (3)

Carlos Pascual
Carlos Pascual

Reputation: 1126

with awk you can get the records that don't have miR:

awk '$0 !~ /miR-/ && $0 ~ /^>/'  file
>LQNS02136402.1_14821_5p  Parhyale hawaiensis 14821_5p
>LQNS02278094.1_35771_5p  Parhyale hawaiensis 35771_5p

and then put miR only in those records:

awk '$0 !~ /miR-/ && $0 ~ /^>/ {gsub(/^>/, ">miR-")} 1' file
>miR-92|LQNS02278089.1_34108_3p  Parhyale hawaiensis 34108_3p
AATTGCACTCGTCCCGGCCTGC
>miR-92|LQNS02278089.1_34106_3p  Parhyale hawaiensis 34106_3p
AATTGCACTGATCCCGGCCTGC
>miR-LQNS02136402.1_14821_5p  Parhyale hawaiensis 14821_5p
CCGTAAGGCCGAAGACAAGAA
>miR-LQNS02278094.1_35771_5p  Parhyale hawaiensis 35771_5p
AAGAATAAGCCCGAGCAAGTCGA

Upvotes: 0

tshiono
tshiono

Reputation: 22012

You can also say with sed:

sed -E "s/^>(miR-)?/>miR-/" seq.fasta

Upvotes: 4

P....
P....

Reputation: 18371

You can to negative lookahead to only match the lines starting with > but not followed by miR-. Notice the single quotes.

perl -p -e 's/^>(?!miR-)/>miR-/g' file

Upvotes: 7

Related Questions