Reputation: 1682
Have a (G|M)awk script that is used to redistribute lines from source files into different files based on content. An example is sharding a set of source data into separate dated files based on a row timestamp.
Its fast enough until I need to add a csv header row. The source data comes in unordered, so I have considered using if (getline < (fname) < 0) { print cols > fname }
to create each destination file with the header line on first touch.
I know that this test for each row of the source data can be expensive, but it appears that as the destination files are created, each getline
test takes longer, as the rate of source files processed slows. The slowdown is the performance issue.
This process is also run under GNU Parallel, so a system -f
test does not work hanging on the system call.
Suggestions on how to address performance while creating files with a header? Looking to keep this script in Awk as there is other logic already in place.
As an example of the task, I have logs from multiple hosts that need to be combined into files based on the date of the log entry:
date, time, measure
2017-01-01, 00:00, 10
2017-01-01, 01:00, 20
2017-01-03, 00:05, 30
2017-01-02, 02:10, 40
2017-01-03, 00:00, 50
The result of this script would be 3 files based on the date column:
Filename: 20170101.log
date, time, measure
2017-01-01, 00:00, 10
2017-01-01, 01:00, 20
Filename: 20170102.log
date, time, measure
2017-01-02, 02:10, 40
Filename: 20170103.log
date, time, measure
2017-01-03, 00:05, 30
2017-01-03, 00:00, 50
Running this log recombination is fast and simple until the need to include the column header came into play. It seems that as the destination files grow the getline
operation takes longer each call. Other examples have shown the use of system
calls to test -f
for the existence of the file, but this too is an expensive operation and seems to hang under GNU Parallel.
Upvotes: 2
Views: 330
Reputation: 1682
It turns out my issue was an issue with record separators.
The source files use RS="\r\n" (set in the BEGIN block), but the destination files use "\n" as a separator since they are simply printed and this is on Linux. This was causing getline to not see lines, so as the destination files grew so did the time for getline - it was reading the whole file.
My rather inelegant solution is:
RS="\n"
if (getline < (fname) < 0) { print cols > fname }
RS="\r\n"
Upvotes: 0
Reputation: 46846
I think the approach I'd take would be to maintain an array of output files/pipes as input appears for them, and add the header on the basis of array membership rather than output-file existence. Testing array membership should be super fast, at least compared to spawning a test
in a shell.
Something like this:
BEGIN {
getline header # this assumes we'll see a header as our
} # first line of input. Use whatever works.
{
outfile=$1 ".log" # or whatever..
if (!(outfile in a)) {
print header > outfile # create the file with the header,
a[outfile] # and record the output file.
}
print > outfile # shard
}
This eliminates the need to touch your filesystem to test for existence, but may be problematic if you have existing files to which you want to append. For that, you might want to prepopulate the array in your BEGIN
block:
BEGIN {
getline header
cmd="ls -d *.log 2>/dev/null"
while (cmd | getline outfile) a[outfile]
close(cmd)
}
{
outfile=$1 ".log"
if (!(outfile in a)) {
print header > outfile
a[outfile]
}
print >> outfile
}
Note: This variant parses ls
(ya ya, I know) to get a list of files to prepopulate the array, and then appends data (>>
) instead of overwriting (>
). I haven't tested this on logfiles that might contain special characters. On the other hand, filenames are $1 ".log"
, so the specialness is already kind of limited.
Upvotes: 1