Reputation: 65
How to split a file.txt into subfiles, where each file has a continues appearance of XX in file.txt? Such as print lines start with XX into file1.txt and if next line is not XX close file1.txt and open file2.txt for next appearance of XX.
Input file: file.txt
some header information
XX 123 456 abc
XX 234 567 def
XX 456 345 ghi
END
XX 345 654 ijk
XX 567 789 klm
XX 678 asd mno
XX 567 thy mnb
YY 123 dft fty
XX 456 tfg tyg
XX 456 thu gtr
PAGE2
XX 345 dcf try
Desired output:
file1.txt
XX 123 456 abc
XX 234 567 def
XX 456 345 ghi
file2.txt
XX 345 654 ijk
XX 567 789 klm
XX 678 asd mno
XX 567 thy mnb
file3.txt
XX 456 tfg tyg
XX 456 thu gtr
file4.txt
XX 345 dcf try
Upvotes: 2
Views: 744
Reputation: 133670
Following awk
may also help you in same.
awk '!/^XX/{delete a[XX];if("file"val".txt"){close("file"val".txt")};next} FNR>1 && !a[XX]++{val++} /^XX/{print > "file"val".txt"}' Input_file
it will create 4 output files as follows.
-rw-rw-r-- 1 singh singh 15 Jan 17 22:29 file4.txt
-rw-rw-r-- 1 singh singh 32 Jan 17 22:29 file3.txt
-rw-rw-r-- 1 singh singh 60 Jan 17 22:29 file2.txt
-rw-rw-r-- 1 singh singh 45 Jan 17 22:29 file1.txt
Above is written and tested in GNU awk
in case you don't have GNU awk
you could change from delete a[XX]
to for(num in a){delete a[num]}
too. I will also add explanation too now for same.
EDIT: Adding a non-one liner form of solution along with explanation too here.
awk '
!/^XX/{ ##Checking condition here if a line is NOT starting from string XX then do following:
delete a[XX]; ##delete the array named a content whose index is string XX here. This is to make sure XX count is getting increased only once not in all occurrences of string XX.
if("file"val".txt"){ ##Checking here if a file named "file"val".txt" is opened here, if yes then close it so that we could avoid the problem of too many files opened here.
close("file"val".txt") ##closing the file named "file"val",txt" here.
}
next ##next is awk out of the box keyword which will skip all further statements from cursor current position and awk will read next line and start executing the statement from starting again.
}
FNR>1 && !a[XX]++{ ##Checking condition here if line number is greater than 1 and array a value with index of string XX is NULL then do following:
val++; ##Increasing the value of variable named val here with 1 each time cursor comes in here.
}
/^XX/{ ##Checking condition here if a line is starting from string XX here then do following:
print > "file"val".txt" ##Printing the current line into variable "file"val".txt" which is actually the output file name like file1.txt, file2.txt etc etc..
}
' Input_file ##Mentioning the Input_file name here.
Upvotes: 1
Reputation: 16997
Using awk
, one-liner:
$ awk '!/^XX/{if(f)close(f);f=sprintf("file%d.txt",++n);next}{print >f}' infile
Explanation:
awk '!/^XX/{ # if line/record/row does not start with XX
if(f) # if variable f was set before
close(f); # close file
f=sprintf("file%d.txt",++n); # pre increment variable n, generate new file name
next # go to next line
}
{
print >f # Records starts with XX will be
# written to file defined in variable f
}
' infile
Test Results:
Input:
$ cat infile
some header information
XX 123 456 abc
XX 234 567 def
XX 456 345 ghi
END
XX 345 654 ijk
XX 567 789 klm
XX 678 asd mno
XX 567 thy mnb
YY 123 dft fty
XX 456 tfg tyg
XX 456 thu gtr
PAGE2
XX 345 dcf try
Output:
$ cat file1.txt
XX 123 456 abc
XX 234 567 def
XX 456 345 ghi
$ cat file2.txt
XX 345 654 ijk
XX 567 789 klm
XX 678 asd mno
XX 567 thy mnb
$ cat file3.txt
XX 456 tfg tyg
XX 456 thu gtr
$ cat file4.txt
XX 345 dcf try
For comment:
if there are too many lines in header information of the input file, the output file name starts with a bigger number. How can I start the output file from ouput1.out and so on?
awk '/^XX/{if(!w)f=sprintf("file%d.txt",++n);w=1;print >f;next}{close(f);w=0}' infile
Explanation:
awk '/^XX/{ # if line starts with XX
if(!w) # if negate of w is true
f=sprintf("file%d.txt",++n); # pre increment n, and set up variable f
w=1; # set variable w = 1
print >f; # write record/row/line to file
next # go to next line
}
{ # for which does not start with XXX
close(f); # close file
w=0 # set w = 0, (so that for next line with XX use newfile)
}
' infile
Test Results - for comment :
Input-modified :
$ cat infile
some header information
some header2
some header 3
XX 123 456 abc
XX 234 567 def
XX 456 345 ghi
END
some more extra
wxxasa
extrasa
XX 345 654 ijk
XX 567 789 klm
XX 678 asd mno
XX 567 thy mnb
YY 123 dft fty
XX 456 tfg tyg
XX 456 thu gtr
PAGE2
XX 345 dcf try
Execution:
$ awk '/^XX/{if(!w)f=sprintf("file%d.txt",++n); w=1; print >f;next}{close(f); w=0}' infile
Files generated:
$ ls *.txt -1
file1.txt
file2.txt
file3.txt
file4.txt
Contents of each file:
$ for i in *.txt; do echo "File: $i"; cat $i; done
File: file1.txt
XX 123 456 abc
XX 234 567 def
XX 456 345 ghi
File: file2.txt
XX 345 654 ijk
XX 567 789 klm
XX 678 asd mno
XX 567 thy mnb
File: file3.txt
XX 456 tfg tyg
XX 456 thu gtr
File: file4.txt
XX 345 dcf try
Upvotes: 4
Reputation: 12448
You can use the following awk command to process the file and generate output files:
awk 'BEGIN{file_cmpt=1}{if($1=="XX"){print $0 > "output"file_cmpt".out";}else{file_cmpt++}}' input.txt
explanation:
BEGIN{file_cmpt=1}
initial the file counter to 1{if($1=="XX"){print $0 > "output"file_cmpt".out";}
if the first column is XX
you print it to the current fileelse{file_cmpt++}
if one line first field is different then XX
then you print it to another filetest on your input file:
if you have some header information at the top of your input file you can use the following awk
command:
awk 'BEGIN{file_cmpt=1;in_header=1}{if($1=="XX"){print $0 > "output"file_cmpt".out";in_header=0}else{if(!in_header){file_cmpt++}}}' input.txt
The only change is that I added a in_header test condition before incrementing the file counter. (we suppose that initially we are in the header part of the file and as soon as we meet the first occurrence of XX
we set the value of this to 0 and we can start incrementing the counter as soon as we reach a line where the first field is not equal to XX
.
tested on the following input:
Upvotes: 0