Reputation: 33
I have a system that is generating very large text logs (in excess of 1GB each). The utility into which I am feeding them requires that each file be less than 500MB. I cannot simply use the split command because this runs the risk of splitting a log entry in half, which would cause errors in the utility to which they are being fed.
I have done some research into split, csplit, and awk. So far I have had the most luck with the following:
awk '/REG_EX/{if(NR%X >= (X-Y) || NR%2000 <= Y)x="split"++i;}{print > x;}' logFile.txt
In the above example, X represents the number of lines I want each split file to contain. In practice, this ends up being about 10 million. Y represents a "plus or minus." So if I want "10 million plus or minus 50", Y allows for that.
The actual regular expression I use is not important, because that part works. The goal is that the file be split every X lines, but only if it is an occurrence of REG_EX. This is where the if() clause comes in. I attempted to have some "wiggle room" of plus or minus Y lines, because there is no guarantee that REG_EX will exist at exactly NR%X. My problem is that if I set Y too small, then I end up with files with two or three times the number of lines I am aiming for. If I set Y too large, then I end up with some files containing anywhere between 1 and X lines(it is possible for REG_EX to occurr several times in immediate succession).
Short of writing my own program that traverses the file line by line with a line counter, how can I go about elegantly solving this problem? I have a script that a co-worker created, but it takes easily over an hour to complete. My awk command completes in less than 60 seconds on a 1.5GB file with a X value of 10 million, but is not a 100% solution.
== EDIT ==
Solution found. Thank you to everyone who took the time to read my question, understand it, and provide a suggested solution. Most of them were very helpful, but the one I marked as the solution provided the greatest assistance. My problem was with my modular math being the cutoff point. I needed a way to keep track of lines and reset the counter each time I split a file. Being new to awk, I wasn't sure how to utilize the BEGIN{ ... }
feature. Allow me to summarize the problem set and then list the command that solved the problem.
PROBLEM:
-- System produces text logs > 1.5GB
-- System into which logs are fed requires logs <= 500MB.
-- Every log entry begins with a standardized line
-- using the split command risks a new file beginning WITHOUT the standard line
REQUIREMENTS:
-- split files at Xth line, BUT
-- IFF Xth line is in the standard log entry format
NOTE:
-- log entries vary in length, with some being entirely empty
SOLUTION:
awk 'BEGIN {min_line=10000000; curr_line=1; new_file="split1"; suff=1;} \
/REG_EX/ \
{if(curr_line >= min_line){new_file="split"++suff; curr_line=1;}} \
{++curr_line; print > new_file;}' logFile.txt
The command can be typed on one line; I broke it up here for readability. Ten million lines works out to between 450MB and 500MB. I realized that given that how frequently the standard log entry line occurrs, I didn't need to set an upper line limit so long as I picked a lower limit with room to spare. Each time the REG_EX is matched, it checks to see if the current number of lines is greater than my limit, and if it is, starts a new file and resets my counter.
Thanks again to everyone. I hope that anyone else who runs into this or a similar problem finds this useful.
Upvotes: 3
Views: 670
Reputation: 1562
If splitting based on the regex is not important, one option would be to create new files line-by-line keeping track of the number of characters you are adding to an output file. If the number of characters are greater than a certain threshold, you can start outputting to the next file. An example command-line script is:
cat logfile.txt | awk 'BEGIN{sum=0; suff=1; new_file="tmp1"} {len=length($0); if ((sum + len) > 500000000) { ++suff; new_file = "tmp"suff; sum = 0} sum += len; print $0 > new_file}'
In this script, sum
keeps track of the number of characters we have parsed from the given log file. If sum
is within 500 MB, it keeps outputting to tmp1
. Once sum
is about to exceed that limit, it will start outputting to tmp2
, and so on.
This script will not create files that are greater than the size limit. It will also not break a log entry.
Please note that this script doesn't make use of any pattern matching that you used in your script.
Upvotes: 0
Reputation: 12672
Replace fout
and slimit
values to your needs
#!/bin/bash
# big log filename
f="test.txt"
fout="$(mktemp -p . f_XXXXX)"
fsize=0
slimit=2500
while read line; do
if [ "$fsize" -le "$slimit" ]; then
# append to log file and get line size at the same time ;-)
lsize=$(echo "$line" | tee -a $fout | wc -c)
# add to file size
fsize=$(( $fsize + $lsize ))
else
echo "size of last file $fout: $fsize"
# create a new log file
fout="$(mktemp -p . f_XXXXX)"
# reset size counter
fsize=0
fi
done < <(grep 'YOUR_REGEXP' "$f")
size of last file ./f_GrWgD: 2537
size of last file ./f_E0n7E: 2547
size of last file ./f_do2AM: 2586
size of last file ./f_lwwhI: 2548
size of last file ./f_4D09V: 2575
size of last file ./f_ZuNBE: 2546
Upvotes: 0
Reputation: 2502
If you want to create split files based on exact n-count of pattern occurrences, you could do this:
awk '/^MYREGEX/ {++i; if(i%3==1){++j}} {print > "splitfilename"j}' logfile.log
Where:
^MYREGEX
is your desired pattern.3
is the count of pattern
occurrences you want in each file.splitfilename
is the prefix of
the filenames to be created.logfile.log
is your input log file.i
is a counter which is incremented for each occurrence of the pattern.j
is a counter which is incremented for each n-th occurrence of the pattern.Example:
$ cat test.log
MY
123
ksdjfkdjk
MY
234
23
MY
345
MY
MY
456
MY
MY
xyz
xyz
MY
something
$ awk '/^MY/ {++i; if(i%3==1){++j}} {print > "file"j}' test.log
$ ls
file1 file2 file3 test.log
$ head file*
==> file1 <==
MY
123
ksdjfkdjk
MY
234
23
MY
345
==> file2 <==
MY
MY
456
MY
==> file3 <==
MY
xyz
xyz
MY
something
Upvotes: 1
Reputation: 2502
You could potentially split the log file by 10 million lines. Then if the 2nd split file does not start with desired line, go find the last desired line in 1st split file, delete that line and subsequent lines from there, then prepend those lines to 2nd file. Repeat for each subsequent split file.
This would produce files with very similar count of your regex matches.
In order to improve performance and not have to actually write out intermediary split files and edit them, you could use a tool such as pt-fifo-split for "virtually" splitting your original log file.
Upvotes: 0