Reputation: 2146
I am scheduling a cron that runs every minute and gives the word count of REJECT
for every minute. My file is logged continuously and to avoid redundant read, I store the lines I read last time while running the script using tail -n +lastTimeWC. But how do I count number of REJECT per minute. sample input:
20170327-09:15:01.283619074 ResponseType:REJECT
20170327-09:15:01.287619074 ResponseType:REJECT
20170327-09:15:01.289619074 ResponseType:REJECT
20170327-09:15:01.290619074 ResponseType:REJECT
20170327-09:15:01.291619074 ResponseType:REJECT
20170327-09:15:01.295619074 ResponseType:REJECT
20170327-09:15:01.297619074 ResponseType:REJECT
20170327-09:16:02.283619074 ResponseType:REJECT
20170327-09:16:03.283619074 ResponseType:REJECT
20170327-09:17:02.283619074 ResponseType:REJECT
20170327-09:17:07.283619074 ResponseType:REJECT
Expected Output:
9:15 REJECT 7
9:16 REJECT 2
9:17 REJECT 2
Update1: (Using Ed Morton's answer)
#!/usr/bin/bash
while :
do
awk -F '[:-]' '{curr=$2":"$3} (prev!="") && (curr!=prev){print NR, prev, $NF, cnt; cnt=0} {cnt++; prev=curr}' $1
sleep 60
done
This script continuously gives me output after 60 seconds. But it should only give new timestamps added to the logfile ($!)
Suppose 9:18 gets added, then it should just start including that to the answer (not 9:15 to 9:18 all again)
Upvotes: 1
Views: 1823
Reputation: 10039
Including the REJECT filter, the date and in stream version (no array in memory, just last counter and date reference
awk -F '-|:..[.]|pe:' '$NF=="REJECT"{if(L==$1"-"$2)C++;else{print L" REJECT " C;C=1;L=$1"-"$2}}END{print L" REJECT " C}' YourLog
Including the 'not retreat same info' like asked in comment (just see in code the "last know time" that is reread)
CFile=Counter.log
# just to insure there is a counter file (could be empty) for awk input
touch ${CFile}
awk -F '-|:..[.]|pe:' -v CF="${CFile}" '
FNR==NR {
if( CF == FILENAME) {L=$0;next}
}
# dont treat element before
# (so we include last know time that was maybe still logging at last cycle)
L > ( $1 "-" $2 ) { next }
$NF=="REJECT" {
if(L==$1"-"$2)C++
else {
print L" REJECT " C;C=1;L=$1"-"$2
}
}
END{
print L" REJECT " C
# write new counter info
print L > CF
}
' ${CFile} YourLog
Upvotes: 0
Reputation: 204731
Don't print the last count since it may not be complete for that timestamp, just print the counts before that:
$ awk -F '[:-]' '{curr=$2":"$3} (prev!="") && (curr!=prev){print prev, cnt, $NF; cnt=0} {cnt++; prev=curr}' file
09:15 REJECT 7
09:16 REJECT 2
If you really WANTED to print the last one too then just add a print in an END section:
$ awk -F '[:-]' '{curr=$2":"$3} (prev!="") && (curr!=prev){print prev, $NF, cnt; cnt=0} {cnt++; prev=curr} END{print prev, $NF, cnt}' file
09:15 REJECT 7
09:16 REJECT 2
09:17 REJECT 2
but I'd imagine you have to just discard that possibly partial result anyway so what's the point?
Note that you don't have to store all the results in an array and then print them in the END section, just print them every time the timestamp changes. In addition to using memory unnecessarily, the solutions that store all of the results in an array and then print them with a loop in the END section using in
will print the output in random (actually hash) order, not the order the timestamps occur in your input (unless by dumb luck sometimes).
Rather than storing the line count of your input file (which can cause false results when a timestamp results are split across invocations of the script AND makes it impossible to use logrotate
or similar to truncate your log file as it's getting long/old), store the last timestamp analyzed and start after that on the current iteration, e.g. do the equivalent of this with cron:
while :
do
results=( $(awk -F '[:-]' -v last="$lastTimeStamp" '{curr=$2":"$3} curr<last{next} (prev!="") && (curr!=prev){print prev, $NF, cnt; cnt=0} {cnt++; prev=curr}' file) )
numResults="${#results[@]}"
if (( numResults > 0 ))
then
printf '%s\n' "${results[@]}"
(( lastIndex = numResults - 1 ))
lastResult="${results[$lastIndex]}"
lastTimeStamp="${lastResult%% *}"
fi
sleep 60
done
or if you wanted to use line numbers so you can do tail
then rather than using wc -l
to get the length of the file (which would include the current timestamp you are not printing potentially incomplete results for), have awk print the line number of the line after the last line associated with each timestamp:
$ awk -F '[:-]' '{curr=$2":"$3} (prev!="") && (curr!=prev){print NR, prev, $NF, cnt; cnt=0} {cnt++; prev=curr}' file
8 09:15 REJECT 7
10 09:16 REJECT 2
and strip it off to save the last value before printing the result. That last value is what you'll do tail -n +<startLineNr> | awk '...'
with next iteration.
btw you didn't show us this in your sample input but if your log file contains lines that do not contain REJECT and you want those ignored, just add $NF!="REJECT"{next}
at the start of the awk script.
Upvotes: 2
Reputation: 85895
You can do this in Awk
, by hashing the minute value as the index and assuming the status does not change per minute, something as below,
awk -F'[-:]' '{unique[$2":"$3]++; uniqueValue[$2":"$3]=$NF; next}END{for (i in unique) print i,uniqueValue[i],unique[i]}' file
09:15 REJECT 7
09:16 REJECT 2
09:17 REJECT 2
Upvotes: 1