use awk to split one file into several small files by pattern

Question

I have read this post about using awk to split one file into several files:

and I am interested in one of the solutions provided by Pramod and jaypal singh:

awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File

Because I still can not add any comment so I ask in here. If the input is

>chr22
asdgasge
asegaseg
>chr1
aweharhaerh
agse
>chr14
gasegaseg

How come it will result in three files:

chr22.fasta  
chr1.fasta  
chr14.fasta

As an example, in chr22.fasta:

>chr22
asdgasge
asegaseg

I understand the first part

/^>chr/ {OUT=substr($0,2) ".fa"};

and these commands:

/^>chr/  substr()  close() >>

But I don't understand that how awk split the input by the second part:

{print >> OUT; close(OUT)}

Could anyone explain more details about this command? Thanks a lot!

RavinderSingh13 · Accepted Answer

Could you please go through following and let me know if this helps you.

awk '                             ##Starting awk program here.
/^>chr/{                          ##Checking condition if a line starts from string chr then do following.
  OUT=substr($0,2) ".fa"          ##Create variable OUT whose value is substring of current line and starts from letter 2nd to till end. concatenating .fa to it too.
}
{
  print >> OUT                    ##Printing current line(s) in file name whose value is variable OUT.
  close(OUT)                      ##using close to close output file whose value if variable OUT value. Basically this is to avoid "TOO MANY FILES OPENED ERROR" error.
}' Input_File                     ##Mentioning Input_file name here.

You could take reference from man awk page for used functions of awk too as follows.

   substr(s, i [, n])      Returns the at most n-character substring of s starting at i.  If n is omitted, the rest of s is used.

use awk to split one file into several small files by pattern

Answers (2)

Related Questions