Reputation: 309
I know this has been asked before but I cannot find a solution that is working - for some reason when I try any of the other solutions posted in stackoverflow they will simply NOT work
I have a directory that has 900+ fasta files, they all finish with ".faa" some of the names are:
TLLD001.faa TLLD002.faa TLLD003.faa TLLD004.faa TLLD005.faa
etc etc
within each file the headers of the fasta are:
>scaffold4567
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold0034
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
or
>NODE_212
WRVLSTSFNGIKYEQSAAFAMIPSTT
>NODE_86667
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
etc etc
I wanna go through all the files and replace the header by adding the filename for example, TLLD001.faa
>scaffold4567
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold0034
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
>scaffold7667
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold6778
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
should become
>TLLD001_scaffold4567
WRVLSTSFNGIKYEQSAAFAMIPSTT
>TLLD001_scaffold0034
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
>TLLD001_scaffold7667
WRVLSTSFNGIKYEQSAAFAMIPSTT
>TLLD001_scaffold6778
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
this is working nicely but i have to specify a single file every time
$awk '/>/{sub(">","&"FILENAME"_");sub(/\.faa/,x)}1' TLLD001.faa
so not my cup of tea
this seems to have worked in 3-4 files i did as a test but it will not work in my 900+ files directory -takes forever-
for i in *.faa; do
sed -i "s/^>/>${i}_/g" *.faa
done
and the following are not working at all:
$for file in *.fasta; do awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < $file > "`basename $file .fasta`_single-line.fasta"; done
and
$for file in *.faa; do awk '/>/{sub(">","&"${file}"_");sub(/\.faa/,x)}1' < $file > "`basename $file .faa`_mod.faa"; done
and I don't know why! any help and explanation of how to use this almighty but cryptic "awk" will be highly appreciated
thanks P
Upvotes: 1
Views: 2713
Reputation: 1
I know its old but on the OSX vers of sed the -i
option expects an extension. So, you need to add an -e
argument and give ''
as an argument to -i
.
for f in *.faa; do sed -i '' -e "s/^>/>${f%.faa}_/g" "${f}"; done
For the OSX folks out there :)
Upvotes: 0
Reputation: 2845
The sed solution is the way to go but you repeated the glob in the command!
Instead of
for f in *.faa; do sed -i "s/^>/>${f%.faa}/g" *.faa; done
Use the ${f} variable in the sed command, otherwise it is expanded for the sed command again!
for f in *.faa; do sed -i "s/^>/>${f%.faa}/g" "${f}"; done
I also made us of some bash variable substituion to simply remove .faa from the file.
Upvotes: 2
Reputation: 67467
this should do
$ for f in *.faa; do sed -i "s/^>/>${f}_/" "$f"; done
however will have the file extension inserted as well. To remove the extension change to ${f%.*}
Upvotes: 4
Reputation: 8711
Try Perl one-liner.
perl -i -0777 -pe ' $x=$ARGV;$x=~s/\.faa//g; s/\>/>${x}_/ ' *faa
Here is the break-up
$ cat TLLD001.faa
>scaffold4567
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold0034
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
>scaffold7667
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold6778
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
$ cat TLLD002.faa
>NODE_212
WRVLSTSFNGIKYEQSAAFAMIPSTT
>NODE_86667
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
Executing command without in-replace
$ perl -0777 -pe ' $x=$ARGV;$x=~s/\.faa//g; s/\>/>${x}_/ ' *faa
>TLLD001_scaffold4567
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold0034
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
>scaffold7667
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold6778
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
>TLLD002_NODE_212
WRVLSTSFNGIKYEQSAAFAMIPSTT
>NODE_86667
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
With in-replace
$ perl -i -0777 -pe ' $x=$ARGV;$x=~s/\.faa//g; s/\>/>${x}_/ ' *faa
Files got modified
$ cat TLLD001.faa
>TLLD001_scaffold4567
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold0034
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
>scaffold7667
WRVLSTSFNGIKYEQSAAFAMIPSTT
>scaffold6778
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
$ cat TLLD002.faa
>TLLD002_NODE_212
WRVLSTSFNGIKYEQSAAFAMIPSTT
>NODE_86667
EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
$
Upvotes: 1