AWK performance while processing big files

Question

I have an awk script that I use for calculate how much time some transactions takes to complete. The script gets the unique ID of each transaction and stores the minimum and maximum timestamp of each one. Then it calculates the difference and at the end it shows those results that are over 60 seconds.

It works very well when used with some thousand (200k) but it takes more time when used in real world. I tested it several times and it takes about 15 minutes to process about 28 million of lines. Can I consider this good performance or it is possible to improve it?

I'm open to any kind of suggestion.

Here you have the complete code

zgrep -E "\(([a-z0-9]){15,}:" /path/to/very/big/log |  awk '{
gsub("[()]|:.*","",$4); #just removing ugly chars
++cont
min=$4"min" #name for maximun value of current transaction
max=$4"max" #same as previous, just for readability 
split($2,secs,/[:,]/) #split hours,minutes and seconds
seconds = 3600*secs[1] + 60*secs[2] + secs[3] #turn everything into seconds
if(arr[min] > seconds || arr[min] == 0)
  arr[min]=seconds
if(arr[max] < seconds)
   arr[max]=seconds
dif=arr[max] - arr[min]
if(dif > 60)
  result[$4] = dif
}
END{
for(x in result)
   print x" - "result[x]
print ":Processed "cont" lines"
}'

Ed Morton · Accepted Answer

You don't need to calculate the dif every time you read a record. Just do it once in the END section.

You don't need that cont variable, just use NR.

You dont need to populate min and max separately string concatenation is slow in awk.

You shouldn't change $4 as that will force the record to be recompiled.

Try this:

awk '{
    name = $4
    gsub(/[()]|:.*/,"",name); #just removing ugly chars

    split($2,secs,/[:,]/) #split hours,minutes and seconds
    seconds = 3600*secs[1] + 60*secs[2] + secs[3] #turn everything into seconds

    if (NR==1) {
        min[name] = max[name] = seconds
    }
    else {
        if (min[name] > seconds) {
            min[name] = seconds
        }
        if (max[name] < seconds) {
            max[name] = seconds
        }
    }
}

END {
    for (name in min) {
        diff = max[name] - min[name]
        if (diff > 60) {
            print name, "-", diff
        }
    }
    print ":Processed", NR, "lines"
}'

AWK performance while processing big files

Answers (2)

Related Questions