Reputation: 39
I need some help with transforming broken logs, I receive from my admins (unfortunately they not gonna fix it). Basically, sometimes log contains duplicated first column, that I want to remove if another column has a value equal to null.
This is how it pretty much looks
datetime,auth_protocol,result,client,port,login
2016-07-15 09:34:52,www,PASS,111.222.333.444,5593,[email protected]
2016-07-15 09:34:52,www,PASS,111.222.333.444,NULL,[email protected]
2016-08-14 00:51:35,www,PASS,NULL,19509,[email protected]
2016-08-14 00:51:35,www,PASS,444.333.222.111,19509,[email protected]
2016-09-16 05:08:46,www,FAIL,123.456.789.222,45673,[email protected]
2016-09-16 15:35:55,www,FAIL,123.456.789.222,NULL,[email protected]
2016-09-17 17:10:01,www,FAIL,111.222.333.444,55516,[email protected]
As you can see, some lines have duplicated first column (date).
What I'm trying to do, is to:
1: For all columns, that first column is duplicated, compare 4 or 5 column (IP or Port)
2: Delete duplicated line with a null value in a column 4 or 5
This sould be the correct output:
datetime,auth_protocol,result,client,port,login
2016-07-15 09:34:52,www,PASS,111.222.333.444,5593,[email protected]
2016-08-14 00:51:35,www,PASS,444.333.222.111,19509,[email protected]
2016-09-16 05:08:46,www,FAIL,123.456.789.222,45673,[email protected]
2016-09-16 15:35:55,www,FAIL,123.456.789.222,NULL,[email protected]
2016-09-17 17:10:01,www,FAIL,111.222.333.444,55516,[email protected]
I hope it sounds clear;d
Upvotes: 0
Views: 130
Reputation: 84561
You can do it in a single pass with a short script solution in bash utilizing arrays. (depending on the length of the file, you may find a double pass with awk
faster, but for larger files, this version may be advantageous). The script simply reads all lines, but defers printing the each line until it has read the next and made a comparison of the date/times to determine if there is a possible duplicate entry. If the date/times are equal, then it checks the IP field. If the IP is NULL
it skips printing that line.
This is just one approach of many. Give it a try with your data:
Edit test added for NULL
in id
field
#!/bin/bash
fn="${1:-/dev/stdin}" ## read file or stdin
prevln="" ## previous line & prev and curr arrays of values
declare -a prev=( ' ' ' ' ' ' ' ' ' ' ' ' ' ' )
declare -a curr
declare -i iddup=0 ## flag marking duplicate in 'id' field
IFS=$' ,\t\n' ## add ',' to internal field separator
while read -r line || test -n "$line"; do ## read each line
curr=( $(echo $line) ) ## fill current array with values
## test prev date/time with curr date/time
if [ "${prev[0]}" = "${curr[0]}" -a "${prev[1]}" = "${curr[1]}" ]; then
if [ "${prev[4]}" != "NULL" ]; then ## if prev IP != NULL print
echo "$prevln" ## otherwise, skip print
fi
[ "${curr[5]}" = "NULL" ] && iddup=1 || iddup=0 ## flag id dup
elif [ "$iddup" -eq '0' ]; then ## if date/time inequal, print line
[ -n "$prevln" ] && echo "$prevln"
fi
prevln="$line" ## assign line to prevln
prev=( ${curr[@]} ) ## assign curr to prev array
done <"$fn"
## same test for the last line after loop exits
curr=( $(echo $line) )
if [ "${prev[0]}" = "${curr[0]}" -a "${prev[1]}" = "${curr[1]}" ]; then
if [ "${prev[4]}" = "NULL" ]; then
echo "$line"
elif [ "${curr[4]}" = "NULL" ]; then
echo "$prevln"
else
echo "$prevln"
fi
else ## if date/time inequal, print line
[ -n "$prevln" ] && echo "$prevln"
fi
Input File
$ cat dat/log.txt
2016-07-15 09:34:52,www,PASS,111.222.333.444,5593,[email protected]
2016-07-15 09:34:52,www,PASS,111.222.333.444,NULL,[email protected]
2016-08-14 00:51:35,www,PASS,NULL,19509,[email protected]
2016-08-14 00:51:35,www,PASS,444.333.222.111,19509,[email protected]
2016-09-16 05:08:46,www,FAIL,123.456.789.222,45673,[email protected]
2016-09-16 15:35:55,www,FAIL,123.456.789.222,NULL,[email protected]
2016-09-17 17:10:01,www,FAIL,111.222.333.444,55516,[email protected]
Example Use/Output
$ bash logdups.sh <dat/log.txt
2016-07-15 09:34:52,www,PASS,111.222.333.444,5593,[email protected]
2016-07-15 00:51:35,www,PASS,444.333.222.111,19509,[email protected]
2016-09-16 05:08:46,www,FAIL,123.456.789.222,45673,[email protected]
2016-09-16 15:35:55,www,FAIL,123.456.789.222,NULL,[email protected]
2016-09-17 17:10:01,www,FAIL,111.222.333.444,55516,[email protected]
Upvotes: 1
Reputation: 47099
By doing two iterations over the log file, you can collect all used dates in the first iteration. And then remove NULL logs if the date is already used in the second iteration:
$ awk -F, '$4$5 !~ /NULL/ {d[$1]=1}
NR == FNR { next }
!d[$1] || $4$5 !~ /NULL/' input.log input.log
The the first iteration d
gets populated with used dates from rows that doesn't contain NULL
in the specified fields.
NR == FNR
is only true for the first file iterated.
In the second iteration lines will be printed if they do not contain NULL
in column 4 nor 5. Or if d
doesn't contain the used date.
Upvotes: 2
Reputation: 11216
You could use this awk.
Iterate over the file twice.
Count dupes first time.
Print line that aren't dupes or don't contain NULL the second.
awk -F, 'NR==FNR{a[$1]++;next}a[$1]<2||$4$5!~/NULL/' file{,}
datetime,auth_protocol,result,client,port,login
2016-07-15 09:34:52,www,PASS,111.222.333.444,5593,[email protected]
2016-08-14 00:51:35,www,PASS,444.333.222.111,19509,[email protected]
2016-09-16 05:08:46,www,FAIL,123.456.789.222,45673,[email protected]
2016-09-16 15:35:55,www,FAIL,123.456.789.222,NULL,[email protected]
2016-09-17 17:10:01,www,FAIL,111.222.333.444,55516,[email protected]
Upvotes: 3