Taliamycota
Taliamycota

Reputation: 135

Split fasta files based on header

I have 1,500 fasta files with many protein fragments in them. My goal is to separate these fragments into single files and to name these files something intuitive.

Here is an example of a fasta file that I have called plate9.H7.faa:

>39_fragment_4_295  (310978..311196)    1   None    hypothetical protein
MQTATKQETYDRTMKVTLAVKANGGSVTVQIQAGDNWITTDTFWKDGGYQLSIPPATIRYVPAAGAAFEVYA*
>39_fragment_4_296  (311193..312437)    1   VOG01158    REFSEQ hypothetical protein
MSLLVNPIPRRQPIRRGLGLLGDSFSGNCHTIAATAFGTEAYGYAGWIAARTGLFPSYVDNQGKLGDHTGQFLARLPACIASSTADLWLLLSRTNDSTTAGMSLADTKANVMKIVTAFLNTPGKYLIIGTGTPRFGSRALTGQALADAIAYKDWVLSYVSQFVPVVNIWDGFTEAMTVEGLHPNLLGAEFISSRVVPIITANFEFPGIPLPTDAGDIYSAIRPFGCLNANPLLAGTGGTLPAGVNAAAGSVLADGYKAVGSGLTGITTRWFKEPAAYGEAQCIELRGNMAAAGGYIYMQPTANVVQTNLAAGDVIEMVSAVEIMGSSRGILAWEAELTITKTVSGAASTFYYRSMDKYQEPFTMPASFSGALETQRGTIDLTETVITSRMGLYLAAGVPQDSTVKAAQFGIRKV*
>56_fragment_9_667  (768674..769846)    -1  K14059  int; integrase
MGRDGRGVRAVSDTSIEITFMYRGVRCRERITLKPSPTNLKKAEQHKAAIEHAISIGAFDYSVTFPGSPRAAKFAPEANRETVAGFLTRWLDGKKRHVSSSTFVGYRKLVELRLVPALGERMVVDLKRKDVRDWLSTLEVSNKTLSNIQSCLRSALNDAAEEELIEVNPLAGWTYSRKEAPAKDDDVDPFSPEEQQAVLAALNGQARNMMQFALWTGLRTSELVALDWGDIDWLREEVMVSRAMTQAAKGQAEVPKTAAGRRSVKLLRPAMEALKAQKAHTFLADAEVFQNPRTLQRWAGDEPIRKTMWVPAIKKAGVNYRRPYQTRHTYASMMLSAGEHPMWVAKQMGHSDWTMIARVYGRWMPYWDDIAGTKAVSQWAENAHESSDSK*
>56_fragment_9_668  (770054..770281)    -1  PF02599.16  Global regulator protein family
MLCLSRRVGESIVIGDNIKITVISGRDGQIRLGIDAPAELAVDRSEVRTAKLATPCGIGLKLRTVAESGARDDEG*
>56_fragment_9_669  (770485..770697)    1   None    hypothetical protein
MECTTTADEVYGPRNAKLGKRAVDGNIWSGTTMIFRIIDDRVYSMHEQYLGRLKYGMAMTDRGELIFIVR*
>56_fragment_9_670  (770705..771487)    -1  VOG00563    sp|Q05292|VG77_BPML5 Gene 77 protein
MSESTIDPKKLERAIRKIKHCLALSQSSNENEAATAMRQAQALMREYHLTETDVKVSDVGEVESSMSRAARRPLWDQQLSAVVATVFNVKALRYTHWCETKKNRVERAKFVGVSPAQHIALYAYETLLAKLSQARNAYVAGVRAGKFRSSYSAPTAGDHFAIAWVFAVESKLQQLVPRGEENTTPEYKGAGPGLVAVEAQHQALIDSYLADKQVGKARKVRGSELDLNAQIAGMLAGTKVDLHAGLANGAEHAQVLPASA*

So far I have been able to split the files into many files with this command:

for x in *.faa; do csplit -z $x '/>/' '{*}'; done

And then rename them according to their fragment in the header:

for file in xx*; do mv "$file" `head -1 "$file" | cut -d$'\t' -f 1`_$x.fasta; done

And then rename each file to not have the '>' from each file, along with assigning it the original filename:

for i in *.fasta; do mv $i `echo $i | cut -c 2-`; done

My problem is that this works on a single file (since there are temporary files in the directory I am doing it in that are temporarily called xx00, xx01, xx02, xx03, and so on..

I feel like my solution would be to loop through each fasta file and do all of these for loops in succession before starting the next fasta file, and I feel like that would have to be a nested for loop which I have never done myself. Any guidance for what I could do would be appreciated.

Upvotes: 2

Views: 3473

Answers (2)

Paul Hodges
Paul Hodges

Reputation: 15273

awk can print to outputs as defined in a variable.
Using your sample data above:

$: ls -l *.fasta
-rw-r--r-- 1 P2759474 1049089 1124 Jun 21 08:56 tmp.fasta

$: for f in *.fasta; do 
     awk '/^>/ { sub(/^>/, "", $1); f=$1; next; } 
          { print >> f; close(f); }' "$f"
   done

$: grep . 56_*
56_fragment_9_667:MGRDGRGVRAVSDTSIEITFMYRGVRCRERITLKPSPTNLKKAEQHKAAIEHAISIGAFDYSVTFPGSPRAAKFAPEANRETVAGFLTRWLDGKKRHVSSSTFVGYRKLVELRLVPALGERMVVDLKRKDVRDWLSTLEVSNKTLSNIQSCLRSALNDAAEEELIEVNPLAGWTYSRKEAPAKDDDVDPFSPEEQQAVLAALNGQARNMMQFALWTGLRTSELVALDWGDIDWLREEVMVSRAMTQAAKGQAEVPKTAAGRRSVKLLRPAMEALKAQKAHTFLADAEVFQNPRTLQRWAGDEPIRKTMWVPAIKKAGVNYRRPYQTRHTYASMMLSAGEHPMWVAKQMGHSDWTMIARVYGRWMPYWDDIAGTKAVSQWAENAHESSDSK*
56_fragment_9_668:MLCLSRRVGESIVIGDNIKITVISGRDGQIRLGIDAPAELAVDRSEVRTAKLATPCGIGLKLRTVAESGARDDEG*
56_fragment_9_669:MECTTTADEVYGPRNAKLGKRAVDGNIWSGTTMIFRIIDDRVYSMHEQYLGRLKYGMAMTDRGELIFIVR*
56_fragment_9_670:MSESTIDPKKLERAIRKIKHCLALSQSSNENEAATAMRQAQALMREYHLTETDVKVSDVGEVESSMSRAARRPLWDQQLSAVVATVFNVKALRYTHWCETKKNRVERAKFVGVSPAQHIALYAYETLLAKLSQARNAYVAGVRAGKFRSSYSAPTAGDHFAIAWVFAVESKLQQLVPRGEENTTPEYKGAGPGLVAVEAQHQALIDSYLADKQVGKARKVRGSELDLNAQIAGMLAGTKVDLHAGLANGAEHAQVLPASA*

Does that help? You could also run the awk's in background to process them in parallel, or use parallel.

Upvotes: 1

ghoti
ghoti

Reputation: 46846

You will improve performance by using a tool that doesn't require files to be opened and closed all the time. Awk is an excellent choice for this.

It seems to me that similar results to what you have written could be achieved with:

$ awk '/^>/ { file=substr($1,2) ".fasta" } { print > file }' *.faa

Note that unless you close() a file, awk leaves it open until the awk process is done, so the solution above will append to common fragment names, should they appear in multiple input files.

If you have a very large number of these (tens of thousands), then *.faa might expand to too many files for your shell to handle on one command line. If that's the case, you could process things more slowly using find.

Upvotes: 3

Related Questions