Panos
Panos

Reputation: 309

add filename to fasta headers in a loop with awk?

I know this has been asked before but I cannot find a solution that is working - for some reason when I try any of the other solutions posted in stackoverflow they will simply NOT work

I have a directory that has 900+ fasta files, they all finish with ".faa" some of the names are:

TLLD001.faa TLLD002.faa TLLD003.faa TLLD004.faa TLLD005.faa

etc etc

within each file the headers of the fasta are:

   >scaffold4567
   WRVLSTSFNGIKYEQSAAFAMIPSTT
   >scaffold0034
   EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ

or

   >NODE_212
   WRVLSTSFNGIKYEQSAAFAMIPSTT
   >NODE_86667
   EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ

etc etc

I wanna go through all the files and replace the header by adding the filename for example, TLLD001.faa

   >scaffold4567
   WRVLSTSFNGIKYEQSAAFAMIPSTT
   >scaffold0034
   EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
   >scaffold7667
   WRVLSTSFNGIKYEQSAAFAMIPSTT
   >scaffold6778
   EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ

should become

   >TLLD001_scaffold4567
   WRVLSTSFNGIKYEQSAAFAMIPSTT
   >TLLD001_scaffold0034
   EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
   >TLLD001_scaffold7667
   WRVLSTSFNGIKYEQSAAFAMIPSTT
   >TLLD001_scaffold6778
   EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ

this is working nicely but i have to specify a single file every time $awk '/>/{sub(">","&"FILENAME"_");sub(/\.faa/,x)}1' TLLD001.faa

so not my cup of tea

this seems to have worked in 3-4 files i did as a test but it will not work in my 900+ files directory -takes forever-

for i in *.faa; do 
    sed -i "s/^>/>${i}_/g" *.faa
done

and the following are not working at all:

$for file in *.fasta; do awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < $file > "`basename $file .fasta`_single-line.fasta"; done

and

$for file in *.faa; do awk '/>/{sub(">","&"${file}"_");sub(/\.faa/,x)}1' < $file > "`basename $file .faa`_mod.faa"; done

and I don't know why! any help and explanation of how to use this almighty but cryptic "awk" will be highly appreciated

thanks P

Upvotes: 1

Views: 2713

Answers (4)

Mark Anthony
Mark Anthony

Reputation: 1

I know its old but on the OSX vers of sed the -i option expects an extension. So, you need to add an -e argument and give '' as an argument to -i.

for f in *.faa; do sed -i '' -e "s/^>/>${f%.faa}_/g" "${f}"; done

For the OSX folks out there :)

Upvotes: 0

Hielke Walinga
Hielke Walinga

Reputation: 2845

The sed solution is the way to go but you repeated the glob in the command!

Instead of

for f in *.faa; do sed -i "s/^>/>${f%.faa}/g" *.faa; done

Use the ${f} variable in the sed command, otherwise it is expanded for the sed command again!

for f in *.faa; do sed -i "s/^>/>${f%.faa}/g" "${f}"; done

I also made us of some bash variable substituion to simply remove .faa from the file.

Upvotes: 2

karakfa
karakfa

Reputation: 67467

this should do

$ for f in *.faa; do sed -i "s/^>/>${f}_/" "$f"; done

however will have the file extension inserted as well. To remove the extension change to ${f%.*}

Upvotes: 4

stack0114106
stack0114106

Reputation: 8711

Try Perl one-liner.

perl -i -0777 -pe ' $x=$ARGV;$x=~s/\.faa//g; s/\>/>${x}_/ ' *faa

Here is the break-up

$ cat  TLLD001.faa
>scaffold4567
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold0034
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
>scaffold7667
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold6778
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ

$ cat TLLD002.faa
>NODE_212
WRVLSTSFNGIKYEQSAAFAMIPSTT
>NODE_86667
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ

Executing command without in-replace

$ perl -0777 -pe ' $x=$ARGV;$x=~s/\.faa//g; s/\>/>${x}_/ ' *faa
>TLLD001_scaffold4567
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold0034
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
>scaffold7667
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold6778
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
>TLLD002_NODE_212
WRVLSTSFNGIKYEQSAAFAMIPSTT
>NODE_86667
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ

With in-replace

$ perl -i -0777 -pe ' $x=$ARGV;$x=~s/\.faa//g; s/\>/>${x}_/ ' *faa

Files got modified

$ cat TLLD001.faa
>TLLD001_scaffold4567
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold0034
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
>scaffold7667
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold6778
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
$ cat TLLD002.faa
>TLLD002_NODE_212
WRVLSTSFNGIKYEQSAAFAMIPSTT
>NODE_86667
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
$

Upvotes: 1

Related Questions