Reputation: 2054

Performance issue with this AWK - This Code takes around 200 mins on 5GB input file

Below code runs on 5GB file and it consumes 99% CPU, I wanted to know if I am doing something terribly wrong or anything can improve execution time.

2013-04-03 08:54:19,989 INFO [Logger] 2013-04-03T08:54:19.987-04:00PCMC.common.manage.springUtil<log-message-body><headers><fedDKPLoggingContext id="DKP_DumpDocumentProperties" type="context.generated.FedDKPLoggingContext"><logFilter>7</logFilter><logSeverity>255</logSeverity><schemaType>PCMC.MRP.DocumentMetaData</schemaType><UID>073104c-4e-4ce-bda-694344ee62</UID><consumerSystemId>JTR</consumerSystemId><consumerLogin>jbserviceid</consumerLogin><logLocation>Successful Completion of Service</logLocation></fedDKPLoggingContext></headers><payload>0</payload></log-message-body>

This is the code I am using. I tried with gz format also but all in vain. And I am call this awk from bash this below command.

awk -f mytest.awk <(gzip -dc scanned-file.$yesterday.gz)| gzip > tem.gz

cat mytest.awk
#!/bin/awk -f

function to_ms (time, time_ms, s) {
    split(time, s, /:|\,/ )
    time_ms = (s[1]*3600+s[2]*60+s[3])*1000+s[4]
    #printf ("%s\n", newtime)
    return time_ms
}

{
   stid = gensub(/.*UID&amp;gt;([^&]+).*/,"\\1","")
}

(stid in starttime) {
    etime = to_ms($2)
    endtime[stid] = etime
    docid[stid] = gensub(/.*id="([^""]+).*/,"\\1","")
    consumer[stid]= gensub(/.*schemaType&amp;gt;PNC.([^.]+).*/,"\\1","")
    state[stid]= gensub(/.*lt;logLocation&amp;gt;([^'' ]+).*/,"\\1","")
    next
}

{
    stime = to_ms($2)
    starttime[stid] = stime
    st_hour[stid] = stime/(60*60*1000)
    timestamp[stid] = $1" "$2
}

END {
    print "Document,Consumer,Hour,ResponseTime,Timestamp,State"
    for (x in starttime) {
        for (y in endtime) {
            if (x==y) {
                diff = (endtime[y]-starttime[x])
                st = sprintf("%02d", st_hour[x])
                print docid[y], consumer[y], st":00", diff, timestamp[x], state[y] |"sort -k3"
                delete starttime[x]
                delete endtime[y]
                delete docid[y]
                delete consumer[y]
                delete timestamp[x]
                delete state[y]
            } 
        }
    }
}

Upvotes: 0

Answers (3)

Devesh

Reputation: 2054

@ Ed, first approach is not giving me expected result. this is what did

# end time and diff 
(stid in starttime) 
{ etime = to_ms($2) 
diff = etime - stime
 print diff,stid 
delete starttime[stid] 
next } 
# Populate starttime
 { 
stime = to_ms($2) 
starttime[stid] = stime 
st_hour[stid] = stime/(60*60*1000)
 }

o/p is like left this sud come in milisecond and with stid.

561849 c858591f-e01b-4407-b9f9-48302b65c383 562740 c858591f-e01b-4407-b9f9-48302b65c383 563629 56c71ef3-d952-4261-9711-16b18a32c6ba 564484 56c71ef3-d952-4261-9711-16b18a32c6ba

Upvotes: 0

TrueY

Reputation: 7610

In the END section it always goes through the internal for-loop, even if the y item is found and then deleted from the endtime array. I would suggest to use a break to jump out of the internal loop.

On the other hand (as I see) the internal loop is not needed at all! It tries to find an element with a known key in an associative array.

Also I may suggest no to delete the found items. To find an item in associative array is done in constant time (depending on the hash key generated algorithm and how many duplicated items are generated), so removing items from such array will not necessarily speed up the process, but the deletion of items will definitely slow it down.

So I may suggest to use this:

for (x in starttime) {
    if (x in endtime) {
        diff = (endtime[x]-starttime[x])
        st = sprintf("%02d", st_hour[x])
        print docid[x], consumer[x], st":00", diff, timestamp[x], state[x] |"sort -k3"
    }
}

Using gzip will even consume more CPU resources, but You can spare some I/O bandwidth.

Upvotes: 3

Ed Morton

Reputation: 203229

Assuming there's just one end time for every stid - don't build up an array of start times and an array of end times and then loop through them both, just process the stid when you hit it's end time. i.e. not this as you do today:

{ stid = whatever }
stid in starttime {
   populate endtime[stid]
   next
}
{ populate starttime[stid] }
END {
   for (x in starttime) {
      for (y in endtime) {
          if (x == y) {
              stid = x
              process endtime[stid] - starttime[stid]
          }
      }
   }
}

but this:

{ stid = whatever }
stid in starttime {
   process to_ms($2) - starttime[stid]
   delete starttime[stid]
   next
}
{ populate starttime[stid] }

If you can't do that, e.g. due to there being multiple records with the same stid and you want to get the time stamps from the first and last ones, then change the loop in your END section to just loop through the stids that you got endtimes for (since you already KNOW they have corresponding starttimes) instead of trying to find all the stids in those massive loops on starttime and endtime, e.g.:

{ stid = whatever }
stid in starttime {
   populate endtime[stid]
   next
}
{ populate starttime[stid] }
END {
   for (stid in endtime) {
      process endtime[stid] - starttime[stid]
   }
}

With either approach you should see a massive performance improvement.

Upvotes: 3

Performance issue with this AWK - This Code takes around 200 mins on 5GB input file

Answers (3)

Related Questions