Reputation: 53
I want to split a text file into multiple files based on the matched regex. This is straight forward using awk. for instance,
tmp_file_prefix="f-" ; awk '/^ID:/{x="'"$tmp_file_prefix"'" ++i;} {print > x;}' file.txt
The catch is, the input text file "file.txt" is huge in size, 2.6 GB to be precise. I'm sure, I'll quickly run out of max file limit in a directory.
The above awk command does my job well and splits the file containing the entire record matching the regex into multiple files. I've executed the command on a smaller sized file with 25 such records each with varied sizes. But I realized that this will overrun the limit of max files in a directory.
I tried the following pattern:
tmp_file_prefix="f-" ; awk -v i=0 '/^ID:/{x="'"$tmp_file_prefix"'" ++i;} i % 20 == 0 {print > x;}' file.txt
and realized that it emits only the 20th pattern and saves the same in the file. This solution is incorrect.
I want a way in the above said awk command whereby I can split the source file into smaller files, each containing 25000 thousand (or n for that matter) occurrences of the regex.
Upvotes: 1
Views: 2106
Reputation: 246764
awk -v prefix="$tmp_file_prefix" -v max=25000 '
function filename() { return sprintf(%s%06d", prefix, ++i) }
!x { x = filename() }
/^ID:/ {
print > x
n++
if (n == max) {
close x
x = ""
n = 0
}
}
' file
This should not run out of open file handles, as it takes care to close the file when done.
Upvotes: 2
Reputation: 2131
You can split the source file into smaller pieces first using split (1), then run your awk script on each piece. Obviously you will need to append to the output files, not overwrite them!
split -l 25000 -a 3 file.txt
will generate files xaaa, xaab, xaac etc., each no more than 25000 lines long, which you can then process with your awk script
Upvotes: 0