mike
mike

Reputation: 87

use awk to split one file into several small files by pattern

I have read this post about using awk to split one file into several files:

and I am interested in one of the solutions provided by Pramod and jaypal singh:

awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File

Because I still can not add any comment so I ask in here. If the input is

>chr22
asdgasge
asegaseg
>chr1
aweharhaerh
agse
>chr14
gasegaseg

How come it will result in three files:

chr22.fasta  
chr1.fasta  
chr14.fasta

As an example, in chr22.fasta:

>chr22
asdgasge
asegaseg

I understand the first part

/^>chr/ {OUT=substr($0,2) ".fa"};

and these commands:

/^>chr/  substr()  close() >>

But I don't understand that how awk split the input by the second part:

{print >> OUT; close(OUT)}

Could anyone explain more details about this command? Thanks a lot!

Upvotes: 2

Views: 460

Answers (2)

kvantour
kvantour

Reputation: 26481

The part you are asking questions about is a bit uncomfortable:

{ print $0 >> OUT; close(OUT) }

With this part, the awk program does the following for every line it processes:

  • Open the file OUT
  • Move the file pointer the the end of the file OUT
  • append the line $0 followed by ORS to the file OUT
  • close the file OUT

Why is this uncomfortable? Mainly because of the structure of your files. You should only close the file when you finished writing to it and not every time you write to it. Currently, if you have a fasta record of 100 lines, it will open, close the file 100 times.

A better approach would be:

awk '/^>chr/{close(OUT); OUT=substr($0,2)".fasta" }
     {print > OUT }
     END {close(OUT)}'

Here we only open the file the first time we write to it and we close it when we don't need it anymore.

note: the END statement is not really needed.

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133518

Could you please go through following and let me know if this helps you.

awk '                             ##Starting awk program here.
/^>chr/{                          ##Checking condition if a line starts from string chr then do following.
  OUT=substr($0,2) ".fa"          ##Create variable OUT whose value is substring of current line and starts from letter 2nd to till end. concatenating .fa to it too.
}
{
  print >> OUT                    ##Printing current line(s) in file name whose value is variable OUT.
  close(OUT)                      ##using close to close output file whose value if variable OUT value. Basically this is to avoid "TOO MANY FILES OPENED ERROR" error.
}' Input_File                     ##Mentioning Input_file name here.

You could take reference from man awk page for used functions of awk too as follows.

   substr(s, i [, n])      Returns the at most n-character substring of s starting at i.  If n is omitted, the rest of s is used.

Upvotes: 2

Related Questions