Awk getline takes progressively more time

Question

Have a (G|M)awk script that is used to redistribute lines from source files into different files based on content. An example is sharding a set of source data into separate dated files based on a row timestamp.

Its fast enough until I need to add a csv header row. The source data comes in unordered, so I have considered using if (getline < (fname) < 0) { print cols > fname } to create each destination file with the header line on first touch.

I know that this test for each row of the source data can be expensive, but it appears that as the destination files are created, each getline test takes longer, as the rate of source files processed slows. The slowdown is the performance issue.

This process is also run under GNU Parallel, so a system -f test does not work hanging on the system call.

Suggestions on how to address performance while creating files with a header? Looking to keep this script in Awk as there is other logic already in place.

As an example of the task, I have logs from multiple hosts that need to be combined into files based on the date of the log entry:

date, time, measure
2017-01-01, 00:00, 10
2017-01-01, 01:00, 20
2017-01-03, 00:05, 30
2017-01-02, 02:10, 40
2017-01-03, 00:00, 50

The result of this script would be 3 files based on the date column:

Filename: 20170101.log

date, time, measure
2017-01-01, 00:00, 10
2017-01-01, 01:00, 20

Filename: 20170102.log

date, time, measure
2017-01-02, 02:10, 40

Filename: 20170103.log

date, time, measure
2017-01-03, 00:05, 30
2017-01-03, 00:00, 50

Running this log recombination is fast and simple until the need to include the column header came into play. It seems that as the destination files grow the getline operation takes longer each call. Other examples have shown the use of system calls to test -f for the existence of the file, but this too is an expensive operation and seems to hang under GNU Parallel.

kermatt · Accepted Answer

It turns out my issue was an issue with record separators.

The source files use RS=" " (set in the BEGIN block), but the destination files use " " as a separator since they are simply printed and this is on Linux. This was causing getline to not see lines, so as the destination files grew so did the time for getline - it was reading the whole file.

My rather inelegant solution is:

RS="
"
if (getline < (fname) < 0) { print cols > fname }
RS="
"

Awk getline takes progressively more time

Answers (2)

Related Questions