Reputation: 2146

Counting lines in a file during particular timestamps in bash

I am scheduling a cron that runs every minute and gives the word count of REJECT for every minute. My file is logged continuously and to avoid redundant read, I store the lines I read last time while running the script using tail -n +lastTimeWC. But how do I count number of REJECT per minute. sample input:

20170327-09:15:01.283619074 ResponseType:REJECT
20170327-09:15:01.287619074 ResponseType:REJECT
20170327-09:15:01.289619074 ResponseType:REJECT
20170327-09:15:01.290619074 ResponseType:REJECT
20170327-09:15:01.291619074 ResponseType:REJECT
20170327-09:15:01.295619074 ResponseType:REJECT
20170327-09:15:01.297619074 ResponseType:REJECT
20170327-09:16:02.283619074 ResponseType:REJECT
20170327-09:16:03.283619074 ResponseType:REJECT
20170327-09:17:02.283619074 ResponseType:REJECT
20170327-09:17:07.283619074 ResponseType:REJECT

Expected Output:

9:15 REJECT 7
9:16 REJECT 2
9:17 REJECT 2

Update1: (Using Ed Morton's answer)

#!/usr/bin/bash
while :
do
awk -F '[:-]' '{curr=$2":"$3} (prev!="") && (curr!=prev){print NR, prev, $NF, cnt; cnt=0} {cnt++; prev=curr}' $1
sleep 60
done

This script continuously gives me output after 60 seconds. But it should only give new timestamps added to the logfile ($!) Suppose 9:18 gets added, then it should just start including that to the answer (not 9:15 to 9:18 all again)

Upvotes: 1

Answers (3)

NeronLeVelu

Reputation: 10039

Including the REJECT filter, the date and in stream version (no array in memory, just last counter and date reference

awk -F '-|:..[.]|pe:' '$NF=="REJECT"{if(L==$1"-"$2)C++;else{print L" REJECT " C;C=1;L=$1"-"$2}}END{print L" REJECT " C}' YourLog

Including the 'not retreat same info' like asked in comment (just see in code the "last know time" that is reread)

CFile=Counter.log
# just to insure there is a counter file (could be empty) for awk input
touch ${CFile}
awk -F '-|:..[.]|pe:' -v CF="${CFile}" '
   FNR==NR {
      if( CF == FILENAME) {L=$0;next}
      }

   # dont treat element before 
   # (so we include last know time that was maybe still logging at last cycle)
   L > ( $1 "-" $2 ) { next }

   $NF=="REJECT" {
      if(L==$1"-"$2)C++
       else {
         print L" REJECT " C;C=1;L=$1"-"$2
         }
      }
   END{
      print L" REJECT " C
      # write new counter info
      print L > CF
      }
   ' ${CFile} YourLog

Upvotes: 0

Ed Morton

Reputation: 204731

Don't print the last count since it may not be complete for that timestamp, just print the counts before that:

$ awk -F '[:-]' '{curr=$2":"$3} (prev!="") && (curr!=prev){print prev, cnt, $NF; cnt=0} {cnt++; prev=curr}' file
09:15 REJECT 7
09:16 REJECT 2

If you really WANTED to print the last one too then just add a print in an END section:

$ awk -F '[:-]' '{curr=$2":"$3} (prev!="") && (curr!=prev){print prev, $NF, cnt; cnt=0} {cnt++; prev=curr} END{print prev, $NF, cnt}' file
09:15 REJECT 7
09:16 REJECT 2
09:17 REJECT 2

but I'd imagine you have to just discard that possibly partial result anyway so what's the point?

Note that you don't have to store all the results in an array and then print them in the END section, just print them every time the timestamp changes. In addition to using memory unnecessarily, the solutions that store all of the results in an array and then print them with a loop in the END section using in will print the output in random (actually hash) order, not the order the timestamps occur in your input (unless by dumb luck sometimes).

Rather than storing the line count of your input file (which can cause false results when a timestamp results are split across invocations of the script AND makes it impossible to use logrotate or similar to truncate your log file as it's getting long/old), store the last timestamp analyzed and start after that on the current iteration, e.g. do the equivalent of this with cron:

while :
do
    results=( $(awk -F '[:-]' -v last="$lastTimeStamp" '{curr=$2":"$3} curr<last{next} (prev!="") && (curr!=prev){print prev, $NF, cnt; cnt=0} {cnt++; prev=curr}' file) )
    numResults="${#results[@]}"
    if (( numResults > 0 ))
    then
        printf '%s\n' "${results[@]}"
        (( lastIndex = numResults - 1 ))
        lastResult="${results[$lastIndex]}"
        lastTimeStamp="${lastResult%% *}"
    fi
    sleep 60
done

or if you wanted to use line numbers so you can do tail then rather than using wc -l to get the length of the file (which would include the current timestamp you are not printing potentially incomplete results for), have awk print the line number of the line after the last line associated with each timestamp:

$ awk -F '[:-]' '{curr=$2":"$3} (prev!="") && (curr!=prev){print NR, prev, $NF, cnt; cnt=0} {cnt++; prev=curr}' file
8 09:15 REJECT 7
10 09:16 REJECT 2

and strip it off to save the last value before printing the result. That last value is what you'll do tail -n +<startLineNr> | awk '...' with next iteration.

btw you didn't show us this in your sample input but if your log file contains lines that do not contain REJECT and you want those ignored, just add $NF!="REJECT"{next} at the start of the awk script.

Upvotes: 2

Inian

Reputation: 85895

You can do this in Awk, by hashing the minute value as the index and assuming the status does not change per minute, something as below,

awk -F'[-:]' '{unique[$2":"$3]++; uniqueValue[$2":"$3]=$NF; next}END{for (i in unique) print i,uniqueValue[i],unique[i]}' file
09:15 REJECT 7
09:16 REJECT 2
09:17 REJECT 2

Upvotes: 1

Counting lines in a file during particular timestamps in bash

Answers (3)

Related Questions