Wyjun
Wyjun

Reputation: 39

Remove duplicated lines

I need some help with transforming broken logs, I receive from my admins (unfortunately they not gonna fix it). Basically, sometimes log contains duplicated first column, that I want to remove if another column has a value equal to null.

This is how it pretty much looks

datetime,auth_protocol,result,client,port,login
2016-07-15 09:34:52,www,PASS,111.222.333.444,5593,[email protected]
2016-07-15 09:34:52,www,PASS,111.222.333.444,NULL,[email protected]
2016-08-14 00:51:35,www,PASS,NULL,19509,[email protected]
2016-08-14 00:51:35,www,PASS,444.333.222.111,19509,[email protected]
2016-09-16 05:08:46,www,FAIL,123.456.789.222,45673,[email protected]
2016-09-16 15:35:55,www,FAIL,123.456.789.222,NULL,[email protected]
2016-09-17 17:10:01,www,FAIL,111.222.333.444,55516,[email protected]

As you can see, some lines have duplicated first column (date).

What I'm trying to do, is to:

1: For all columns, that first column is duplicated, compare 4 or 5 column (IP or Port)

2: Delete duplicated line with a null value in a column 4 or 5

This sould be the correct output:

datetime,auth_protocol,result,client,port,login
2016-07-15 09:34:52,www,PASS,111.222.333.444,5593,[email protected]
2016-08-14 00:51:35,www,PASS,444.333.222.111,19509,[email protected]
2016-09-16 05:08:46,www,FAIL,123.456.789.222,45673,[email protected]
2016-09-16 15:35:55,www,FAIL,123.456.789.222,NULL,[email protected]
2016-09-17 17:10:01,www,FAIL,111.222.333.444,55516,[email protected]

I hope it sounds clear;d

Upvotes: 0

Views: 130

Answers (3)

David C. Rankin
David C. Rankin

Reputation: 84561

You can do it in a single pass with a short script solution in bash utilizing arrays. (depending on the length of the file, you may find a double pass with awk faster, but for larger files, this version may be advantageous). The script simply reads all lines, but defers printing the each line until it has read the next and made a comparison of the date/times to determine if there is a possible duplicate entry. If the date/times are equal, then it checks the IP field. If the IP is NULL it skips printing that line.

This is just one approach of many. Give it a try with your data:

Edit test added for NULL in id field

#!/bin/bash

fn="${1:-/dev/stdin}"   ## read file or stdin

prevln=""   ## previous line & prev and curr arrays of values
declare -a prev=( ' ' ' ' ' ' ' ' ' ' ' ' ' ' )
declare -a curr
declare -i iddup=0  ## flag marking duplicate in 'id' field

IFS=$' ,\t\n'   ## add ',' to internal field separator

while read -r line || test -n "$line"; do   ## read each line
    curr=( $(echo $line) )  ## fill current array with values
    ## test prev date/time with curr date/time
    if [ "${prev[0]}" = "${curr[0]}" -a "${prev[1]}" = "${curr[1]}" ]; then
        if [ "${prev[4]}" != "NULL" ]; then     ## if prev IP != NULL print
            echo "$prevln"                      ## otherwise, skip print
        fi
        [ "${curr[5]}" = "NULL" ] && iddup=1 || iddup=0  ## flag id dup
    elif [ "$iddup" -eq '0' ]; then   ## if date/time inequal, print line
        [ -n "$prevln" ] && echo "$prevln"
    fi
    prevln="$line"      ## assign line to prevln
    prev=( ${curr[@]} ) ## assign curr to prev array
done <"$fn"

## same test for the last line after loop exits
curr=( $(echo $line) )
if [ "${prev[0]}" = "${curr[0]}" -a "${prev[1]}" = "${curr[1]}" ]; then
    if [ "${prev[4]}" = "NULL" ]; then
        echo "$line"
    elif [ "${curr[4]}" = "NULL" ]; then
        echo "$prevln"
    else
        echo "$prevln"
    fi
else    ## if date/time inequal, print line
    [ -n "$prevln" ] && echo "$prevln"
fi

Input File

$ cat dat/log.txt
2016-07-15 09:34:52,www,PASS,111.222.333.444,5593,[email protected]
2016-07-15 09:34:52,www,PASS,111.222.333.444,NULL,[email protected]
2016-08-14 00:51:35,www,PASS,NULL,19509,[email protected]
2016-08-14 00:51:35,www,PASS,444.333.222.111,19509,[email protected]
2016-09-16 05:08:46,www,FAIL,123.456.789.222,45673,[email protected]
2016-09-16 15:35:55,www,FAIL,123.456.789.222,NULL,[email protected]
2016-09-17 17:10:01,www,FAIL,111.222.333.444,55516,[email protected]

Example Use/Output

$ bash logdups.sh <dat/log.txt
2016-07-15 09:34:52,www,PASS,111.222.333.444,5593,[email protected]
2016-07-15 00:51:35,www,PASS,444.333.222.111,19509,[email protected]
2016-09-16 05:08:46,www,FAIL,123.456.789.222,45673,[email protected]
2016-09-16 15:35:55,www,FAIL,123.456.789.222,NULL,[email protected]
2016-09-17 17:10:01,www,FAIL,111.222.333.444,55516,[email protected]

Upvotes: 1

Andreas Louv
Andreas Louv

Reputation: 47099

By doing two iterations over the log file, you can collect all used dates in the first iteration. And then remove NULL logs if the date is already used in the second iteration:

$ awk -F, '$4$5 !~ /NULL/ {d[$1]=1}
           NR == FNR { next }
           !d[$1] || $4$5 !~ /NULL/' input.log input.log

The the first iteration d gets populated with used dates from rows that doesn't contain NULL in the specified fields.

NR == FNR is only true for the first file iterated.

In the second iteration lines will be printed if they do not contain NULL in column 4 nor 5. Or if d doesn't contain the used date.

Upvotes: 2

123
123

Reputation: 11216

You could use this awk.

Iterate over the file twice.
Count dupes first time.
Print line that aren't dupes or don't contain NULL the second.

awk -F, 'NR==FNR{a[$1]++;next}a[$1]<2||$4$5!~/NULL/' file{,}

datetime,auth_protocol,result,client,port,login
2016-07-15 09:34:52,www,PASS,111.222.333.444,5593,[email protected]
2016-08-14 00:51:35,www,PASS,444.333.222.111,19509,[email protected]
2016-09-16 05:08:46,www,FAIL,123.456.789.222,45673,[email protected]
2016-09-16 15:35:55,www,FAIL,123.456.789.222,NULL,[email protected]
2016-09-17 17:10:01,www,FAIL,111.222.333.444,55516,[email protected]

Upvotes: 3

Related Questions