user3132983
user3132983

Reputation: 65

split file into multiple files based on appearance of name in first column

How to split a file.txt into subfiles, where each file has a continues appearance of XX in file.txt? Such as print lines start with XX into file1.txt and if next line is not XX close file1.txt and open file2.txt for next appearance of XX.

Input file: file.txt

some header information
XX 123 456 abc
XX 234 567 def
XX 456 345 ghi
END
XX 345 654 ijk
XX 567 789 klm
XX 678 asd mno
XX 567 thy mnb
YY 123 dft fty
XX 456 tfg tyg 
XX 456 thu gtr 
PAGE2
XX 345 dcf try

Desired output:

file1.txt

XX 123 456 abc
XX 234 567 def
XX 456 345 ghi

file2.txt

XX 345 654 ijk
XX 567 789 klm
XX 678 asd mno
XX 567 thy mnb

file3.txt

XX 456 tfg tyg 
XX 456 thu gtr

file4.txt

XX 345 dcf try

Upvotes: 2

Views: 744

Answers (3)

RavinderSingh13
RavinderSingh13

Reputation: 133670

Following awk may also help you in same.

awk '!/^XX/{delete a[XX];if("file"val".txt"){close("file"val".txt")};next} FNR>1 && !a[XX]++{val++} /^XX/{print > "file"val".txt"}' Input_file

it will create 4 output files as follows.

-rw-rw-r--  1 singh singh   15 Jan 17 22:29 file4.txt
-rw-rw-r--  1 singh singh   32 Jan 17 22:29 file3.txt
-rw-rw-r--  1 singh singh   60 Jan 17 22:29 file2.txt
-rw-rw-r--  1 singh singh   45 Jan 17 22:29 file1.txt

Above is written and tested in GNU awk in case you don't have GNU awk you could change from delete a[XX] to for(num in a){delete a[num]} too. I will also add explanation too now for same.

EDIT: Adding a non-one liner form of solution along with explanation too here.

awk '
!/^XX/{                    ##Checking condition here if a line is NOT starting from string XX then do following:
  delete a[XX];            ##delete the array named a content whose index is string XX here. This is to make sure XX count is getting increased only once not in all occurrences of string XX.
  if("file"val".txt"){     ##Checking here if a file named "file"val".txt" is opened here, if yes then close it so that we could avoid the problem of too many files opened here.
    close("file"val".txt") ##closing the file named "file"val",txt" here.
}
  next                     ##next is awk out of the box keyword which will skip all further statements from cursor current position and awk will read next line and start executing the statement from starting again.
}
FNR>1 && !a[XX]++{         ##Checking condition here if line number is greater than 1 and array a value with index of string XX is NULL then do following:
  val++;                   ##Increasing the value of variable named val here with 1 each time cursor comes in here.
}
/^XX/{                     ##Checking condition here if a line is starting from string XX here then do following:
  print > "file"val".txt"  ##Printing the current line into variable "file"val".txt" which is actually the output file name like file1.txt, file2.txt etc etc..
}
'  Input_file             ##Mentioning the Input_file name here.

Upvotes: 1

Akshay Hegde
Akshay Hegde

Reputation: 16997

Using awk, one-liner:

$ awk '!/^XX/{if(f)close(f);f=sprintf("file%d.txt",++n);next}{print >f}' infile

Explanation:

awk '!/^XX/{                          # if line/record/row does not start with XX
         if(f)                        # if variable f was set before
            close(f);                 # close file 
         f=sprintf("file%d.txt",++n); # pre increment variable n, generate new file name
         next                         # go to next line
      }
      {
         print >f                     # Records starts with XX will be
                                      # written to file defined in variable f
      }
     ' infile

Test Results:

Input:

$ cat infile
some header information
XX 123 456 abc
XX 234 567 def
XX 456 345 ghi
END
XX 345 654 ijk
XX 567 789 klm
XX 678 asd mno
XX 567 thy mnb
YY 123 dft fty
XX 456 tfg tyg 
XX 456 thu gtr 
PAGE2
XX 345 dcf try

Output:

$ cat file1.txt 
XX 123 456 abc
XX 234 567 def
XX 456 345 ghi

$ cat file2.txt 
XX 345 654 ijk
XX 567 789 klm
XX 678 asd mno
XX 567 thy mnb

$ cat file3.txt 
XX 456 tfg tyg 
XX 456 thu gtr 

$ cat file4.txt 
XX 345 dcf try

For comment:

if there are too many lines in header information of the input file, the output file name starts with a bigger number. How can I start the output file from ouput1.out and so on?

awk '/^XX/{if(!w)f=sprintf("file%d.txt",++n);w=1;print >f;next}{close(f);w=0}' infile

Explanation:

awk '/^XX/{                             # if line starts with XX
        if(!w)                          # if negate of w is true
           f=sprintf("file%d.txt",++n); # pre increment n, and set up variable f 
        w=1;                            # set variable w = 1
        print >f;                       # write record/row/line to file
        next                            # go to next line
     }
     {                                  # for which does not start with XXX  
        close(f);                       # close file
        w=0                             # set w = 0, (so that for next line with XX use newfile)
     }
    ' infile

Test Results - for comment :

Input-modified :

$ cat infile 
some header information
some header2
some header 3
XX 123 456 abc
XX 234 567 def
XX 456 345 ghi
END
some more extra
wxxasa
extrasa
XX 345 654 ijk
XX 567 789 klm
XX 678 asd mno
XX 567 thy mnb
YY 123 dft fty
XX 456 tfg tyg 
XX 456 thu gtr 
PAGE2
XX 345 dcf try

Execution:

$ awk '/^XX/{if(!w)f=sprintf("file%d.txt",++n); w=1;  print >f;next}{close(f); w=0}' infile 

Files generated:

$ ls *.txt -1
file1.txt
file2.txt
file3.txt
file4.txt

Contents of each file:

$ for i in *.txt; do echo "File: $i"; cat $i; done
File: file1.txt
XX 123 456 abc
XX 234 567 def
XX 456 345 ghi
File: file2.txt
XX 345 654 ijk
XX 567 789 klm
XX 678 asd mno
XX 567 thy mnb
File: file3.txt
XX 456 tfg tyg 
XX 456 thu gtr 
File: file4.txt
XX 345 dcf try

Upvotes: 4

Allan
Allan

Reputation: 12448

You can use the following awk command to process the file and generate output files:

awk 'BEGIN{file_cmpt=1}{if($1=="XX"){print $0 > "output"file_cmpt".out";}else{file_cmpt++}}' input.txt

explanation:

  • BEGIN{file_cmpt=1} initial the file counter to 1
  • {if($1=="XX"){print $0 > "output"file_cmpt".out";} if the first column is XX you print it to the current file
  • else{file_cmpt++} if one line first field is different then XX then you print it to another file

test on your input file:

enter image description here

if you have some header information at the top of your input file you can use the following awk command:

awk 'BEGIN{file_cmpt=1;in_header=1}{if($1=="XX"){print $0 > "output"file_cmpt".out";in_header=0}else{if(!in_header){file_cmpt++}}}' input.txt

The only change is that I added a in_header test condition before incrementing the file counter. (we suppose that initially we are in the header part of the file and as soon as we meet the first occurrence of XXwe set the value of this to 0 and we can start incrementing the counter as soon as we reach a line where the first field is not equal to XX.

tested on the following input:

enter image description here

Upvotes: 0

Related Questions