Reputation: 135
I have 1,500 fasta files with many protein fragments in them. My goal is to separate these fragments into single files and to name these files something intuitive.
Here is an example of a fasta file that I have called plate9.H7.faa:
>39_fragment_4_295 (310978..311196) 1 None hypothetical protein
MQTATKQETYDRTMKVTLAVKANGGSVTVQIQAGDNWITTDTFWKDGGYQLSIPPATIRYVPAAGAAFEVYA*
>39_fragment_4_296 (311193..312437) 1 VOG01158 REFSEQ hypothetical protein
MSLLVNPIPRRQPIRRGLGLLGDSFSGNCHTIAATAFGTEAYGYAGWIAARTGLFPSYVDNQGKLGDHTGQFLARLPACIASSTADLWLLLSRTNDSTTAGMSLADTKANVMKIVTAFLNTPGKYLIIGTGTPRFGSRALTGQALADAIAYKDWVLSYVSQFVPVVNIWDGFTEAMTVEGLHPNLLGAEFISSRVVPIITANFEFPGIPLPTDAGDIYSAIRPFGCLNANPLLAGTGGTLPAGVNAAAGSVLADGYKAVGSGLTGITTRWFKEPAAYGEAQCIELRGNMAAAGGYIYMQPTANVVQTNLAAGDVIEMVSAVEIMGSSRGILAWEAELTITKTVSGAASTFYYRSMDKYQEPFTMPASFSGALETQRGTIDLTETVITSRMGLYLAAGVPQDSTVKAAQFGIRKV*
>56_fragment_9_667 (768674..769846) -1 K14059 int; integrase
MGRDGRGVRAVSDTSIEITFMYRGVRCRERITLKPSPTNLKKAEQHKAAIEHAISIGAFDYSVTFPGSPRAAKFAPEANRETVAGFLTRWLDGKKRHVSSSTFVGYRKLVELRLVPALGERMVVDLKRKDVRDWLSTLEVSNKTLSNIQSCLRSALNDAAEEELIEVNPLAGWTYSRKEAPAKDDDVDPFSPEEQQAVLAALNGQARNMMQFALWTGLRTSELVALDWGDIDWLREEVMVSRAMTQAAKGQAEVPKTAAGRRSVKLLRPAMEALKAQKAHTFLADAEVFQNPRTLQRWAGDEPIRKTMWVPAIKKAGVNYRRPYQTRHTYASMMLSAGEHPMWVAKQMGHSDWTMIARVYGRWMPYWDDIAGTKAVSQWAENAHESSDSK*
>56_fragment_9_668 (770054..770281) -1 PF02599.16 Global regulator protein family
MLCLSRRVGESIVIGDNIKITVISGRDGQIRLGIDAPAELAVDRSEVRTAKLATPCGIGLKLRTVAESGARDDEG*
>56_fragment_9_669 (770485..770697) 1 None hypothetical protein
MECTTTADEVYGPRNAKLGKRAVDGNIWSGTTMIFRIIDDRVYSMHEQYLGRLKYGMAMTDRGELIFIVR*
>56_fragment_9_670 (770705..771487) -1 VOG00563 sp|Q05292|VG77_BPML5 Gene 77 protein
MSESTIDPKKLERAIRKIKHCLALSQSSNENEAATAMRQAQALMREYHLTETDVKVSDVGEVESSMSRAARRPLWDQQLSAVVATVFNVKALRYTHWCETKKNRVERAKFVGVSPAQHIALYAYETLLAKLSQARNAYVAGVRAGKFRSSYSAPTAGDHFAIAWVFAVESKLQQLVPRGEENTTPEYKGAGPGLVAVEAQHQALIDSYLADKQVGKARKVRGSELDLNAQIAGMLAGTKVDLHAGLANGAEHAQVLPASA*
So far I have been able to split the files into many files with this command:
for x in *.faa; do csplit -z $x '/>/' '{*}'; done
And then rename them according to their fragment in the header:
for file in xx*; do mv "$file" `head -1 "$file" | cut -d$'\t' -f 1`_$x.fasta; done
And then rename each file to not have the '>' from each file, along with assigning it the original filename:
for i in *.fasta; do mv $i `echo $i | cut -c 2-`; done
My problem is that this works on a single file (since there are temporary files in the directory I am doing it in that are temporarily called xx00, xx01, xx02, xx03, and so on..
I feel like my solution would be to loop through each fasta file and do all of these for loops in succession before starting the next fasta file, and I feel like that would have to be a nested for loop which I have never done myself. Any guidance for what I could do would be appreciated.
Upvotes: 2
Views: 3473
Reputation: 15273
awk
can print to outputs as defined in a variable.
Using your sample data above:
$: ls -l *.fasta
-rw-r--r-- 1 P2759474 1049089 1124 Jun 21 08:56 tmp.fasta
$: for f in *.fasta; do
awk '/^>/ { sub(/^>/, "", $1); f=$1; next; }
{ print >> f; close(f); }' "$f"
done
$: grep . 56_*
56_fragment_9_667:MGRDGRGVRAVSDTSIEITFMYRGVRCRERITLKPSPTNLKKAEQHKAAIEHAISIGAFDYSVTFPGSPRAAKFAPEANRETVAGFLTRWLDGKKRHVSSSTFVGYRKLVELRLVPALGERMVVDLKRKDVRDWLSTLEVSNKTLSNIQSCLRSALNDAAEEELIEVNPLAGWTYSRKEAPAKDDDVDPFSPEEQQAVLAALNGQARNMMQFALWTGLRTSELVALDWGDIDWLREEVMVSRAMTQAAKGQAEVPKTAAGRRSVKLLRPAMEALKAQKAHTFLADAEVFQNPRTLQRWAGDEPIRKTMWVPAIKKAGVNYRRPYQTRHTYASMMLSAGEHPMWVAKQMGHSDWTMIARVYGRWMPYWDDIAGTKAVSQWAENAHESSDSK*
56_fragment_9_668:MLCLSRRVGESIVIGDNIKITVISGRDGQIRLGIDAPAELAVDRSEVRTAKLATPCGIGLKLRTVAESGARDDEG*
56_fragment_9_669:MECTTTADEVYGPRNAKLGKRAVDGNIWSGTTMIFRIIDDRVYSMHEQYLGRLKYGMAMTDRGELIFIVR*
56_fragment_9_670:MSESTIDPKKLERAIRKIKHCLALSQSSNENEAATAMRQAQALMREYHLTETDVKVSDVGEVESSMSRAARRPLWDQQLSAVVATVFNVKALRYTHWCETKKNRVERAKFVGVSPAQHIALYAYETLLAKLSQARNAYVAGVRAGKFRSSYSAPTAGDHFAIAWVFAVESKLQQLVPRGEENTTPEYKGAGPGLVAVEAQHQALIDSYLADKQVGKARKVRGSELDLNAQIAGMLAGTKVDLHAGLANGAEHAQVLPASA*
Does that help? You could also run the awk
's in background to process them in parallel, or use parallel
.
Upvotes: 1
Reputation: 46846
You will improve performance by using a tool that doesn't require files to be opened and closed all the time. Awk is an excellent choice for this.
It seems to me that similar results to what you have written could be achieved with:
$ awk '/^>/ { file=substr($1,2) ".fasta" } { print > file }' *.faa
Note that unless you close()
a file, awk leaves it open until the awk process is done, so the solution above will append to common fragment names, should they appear in multiple input files.
If you have a very large number of these (tens of thousands), then *.faa
might expand to too many files for your shell to handle on one command line. If that's the case, you could process things more slowly using find
.
Upvotes: 3