ZacEsa
ZacEsa

Reputation: 11

Bash script to filter out non-adjacent duplicates in logs

I'm trying to create a script to filter out duplicates in my logs and keep the latest of each message. A sample would be below;

May 29 22:25:19 servername.com Fdm: this is error message 1 error code=0x98765
May 29 22:25:19 servername.com Fdm: this is just a message
May 29 22:25:19 servername.com Fdm: error code=12345 message 2
May 29 22:25:20 servername.com Vpxa: this is error message 1 error code=0x67890
May 29 22:25:20 servername.com Vpxa: just another message
May 29 22:25:30 servername.com Fdm: error code=34567 message 2
May 29 22:25:30 servername.com Fdm: another error message 3 76543

The logs are split between two files, I've already gotten down to creating the script to merge the two files and sort the files by date using sort -s -r -k1.

I've also managed to create the script so that it asks for the date I want then it uses grep to filter out by date.

Right now, I only need to find a way to remove the non-adjacent duplicate lines which also have different timestamps. I tried awk but, my knowledge with awk isn't that great. Any awk-gurus out there able to assist me?

P.S., One of the issue I'm encountering is that there are same lines with different error codes, I want to remove those lines but, I can only go so by grep -v "Constant part of line". If there's a way for me to remove duplicates by percentage of similarity, that'll be great. Also, I can't get the script to ignore certain fields or columns because there are lines with error codes at different fields/columns.

Expected output as below;

May 29 22:25:30 servername.com Fdm: another error message 3 76543
May 29 22:25:30 servername.com Fdm: error code=34567 message 2
May 29 22:25:20 servername.com Vpxa: this is error message 1 error code=0x67890

I only want the errors but, that's easily done with grep -i error. The only issue is the duplicate lines with different error codes.

Upvotes: 1

Views: 782

Answers (5)

ZacEsa
ZacEsa

Reputation: 11

I managed to find a way to do it. Just to give you guys more details about the issue I had and what this script does.

Issue: I had logs which I have to clear but, the logs have multiple lines with repeating error. Unfortunately, the repeating errors have different error codes so, I'm not able to just grep -v them. Plus, the logs have tens of thousands of lines so, to keep "grep -v"-ing them would consume lots of time so, I've decided to semi-automate it using scripts. Below is the script. If you have ideas on how to improve the script, please do comment!

#!/usr/local/bin/bash

rm /tmp/tmp.log /tmp/tmpfiltered.log 2> /dev/null

printf "Please key in full location of logs: "

read log1loc log2loc

cat $log1loc $log2loc >> /tmp/tmp.log

sort -s -r -k1 /tmp/tmp.log -o /tmp/tmp.log

printf "Please key in the date: "

read logdate

while [[ $firstlineedit != "n" ]]

        do

        grep -e "$logdate" /tmp/tmp.log | grep -i error | less

        firstline=$(head -n 1 /tmp/tmp.log)

        head -n 1 /tmp/tmp.log >> /tmp/tmpfiltered.log

        read -p "Enter line to remove(enter n to quit): " -e -i "$firstline" firstlineedit

        firstlinecount=$(grep -e "$logdate" /tmp/tmp.log | grep -i error | grep -o "$firstlineedit" | wc -l)

        grep -e "$logdate" /tmp/tmp.log | grep -i error | grep -v "$firstlineedit" > /tmp/tmp2.log

        mv /tmp/tmp2.log /tmp/tmp.log

        if [ "$firstlineedit" != "n" ];

                then

                echo That line and it"'"s variations have appeared $firstlinecount times in the log!

        fi
done

cat /tmp/tmpfiltered.log | less

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203597

You don't tell us how you define a "duplicate" but if you mean messages in the same day then this will do it:

$ tac file | awk '!seen[$1,$2,$3]++' | tac
May 29 22:25:19 servername.com Fdm: error code=12345 message 2
May 29 22:25:20 servername.com Vpxa: just another message
May 29 22:25:30 servername.com Fdm: another error message 3 76543

If that's not what you mean then just change the indices used in the awk array to be whatever it is you do want to consider for duplication the test.

Given your recent comments maybe this is what you want:

$ tac file | awk '!/error/{next} {k=$0; sub(/([^:]+:){3}/,"",k); gsub(/[0-9]+/,"#",k)} !seen[k]++' | tac
May 29 22:25:20 servername.com Vpxa: this is error message 1 error code=0x67890
May 29 22:25:30 servername.com Fdm: error code=34567 message 2
May 29 22:25:30 servername.com Fdm: another error message 3 76543

The above works by creating a key value, k, that's the part after the first : that's not part of the time field, with all sequences of digits changed to a #:

$ awk '!/error/{next} {k=$0; sub(/([^:]+:){3}/,"",k); gsub(/[0-9]+/,"#",k); print $0 ORS "\t -> key =", k}' file
May 29 22:25:19 servername.com Fdm: this is error message 1 error code=0x98765
         -> key =  this is error message # error code=#x#
May 29 22:25:19 servername.com Fdm: error code=12345 message 2
         -> key =  error code=# message #
May 29 22:25:20 servername.com Vpxa: this is error message 1 error code=0x67890
         -> key =  this is error message # error code=#x#
May 29 22:25:30 servername.com Fdm: error code=34567 message 2
         -> key =  error code=# message #
May 29 22:25:30 servername.com Fdm: another error message 3 76543
         -> key =  another error message # #

Upvotes: 0

Jeffrey Cash
Jeffrey Cash

Reputation: 1073

You could always skip the first 3 fields and remove duplicates using sort -suk4. The first 3 fields will be the date string, so any two lines with identical text after that will be deleted. Then you can sort the fields however you want for the output

sort -suk4 filename | sort -rs

Getting rid of lines with differing error codes would be trickier, but I would recommend isolating the lines with error codes into their own file and then using something like the following:

sed 's/\(.*error code=\)\([0-9]*\)/\2 \1/' errorfile | sort -suk5 | sed 's/\([0-9]*\) \(.*error code=\)/\2\1/'

Upvotes: 1

zetavolt
zetavolt

Reputation: 3207

To remove identical lines with differing timestamps, you can simply check for duplicates after the 15th character.

awk '!duplicates[substr($0,15)]++' $filename

If your logs are tab-delimited, you can be even more precise and select which columns you want to determine duplicates from, which is a better solution than trying to find a Levenshtein distance between different files.

Upvotes: 1

heemayl
heemayl

Reputation: 42017

You can do this with sort alone.

Just operate on the fields starting from 4th to have the duplicates:

sort -uk4 file.txt

This will give you the first entry from dupes; If you want the last one use tac beforehand:

tac file.txt | sort -uk4 

Example:

$ cat file.txt      
May 29 22:25:19 servername.com Fdm: [FF93DB90 verbose 'Cluster' opID=SWI-56f32f43] Updating inventory manager with 1 datastores
May 29 22:25:19 servername.com Fdm: [FF93DB90 verbose 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::UpdateDatastoreLockStatus] Lock state change to 4 for datastore /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050
May 29 22:25:19 servername.com Fdm: [FFB03B90 verbose 'Invt' opID=SWI-65391264] [DsStateChange::SaveToInventory] Processing locked error update for /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050 (<unset>) from __localhost__
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaMoVm' opID=SWI-54ad408b] [VpxaMoVm::CheckMoVm] did not find a VM with ID 17 in the vmList
May 21 12:05:02 servername.com Fdm: [FF93DB90 verbose 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::UpdateDatastoreLockStatus] Lock state change to 4 for datastore /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaAlarm' opID=SWI-54ad408b] [VpxaAlarm] VM with vmid = 17 not found
May 30 07:50:07 servername.com Fdm: [FF93DB90 verbose 'Cluster' opID=SWI-56f32f43] Updating inventory manager with 1 datastores

$ sort -uk4 file.txt
May 29 22:25:19 servername.com Fdm: [FF93DB90 verbose 'Cluster' opID=SWI-56f32f43] Updating inventory manager with 1 datastores
May 29 22:25:19 servername.com Fdm: [FF93DB90 verbose 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::UpdateDatastoreLockStatus] Lock state change to 4 for datastore /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050
May 29 22:25:19 servername.com Fdm: [FFB03B90 verbose 'Invt' opID=SWI-65391264] [DsStateChange::SaveToInventory] Processing locked error update for /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050 (<unset>) from __localhost__
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaAlarm' opID=SWI-54ad408b] [VpxaAlarm] VM with vmid = 17 not found
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaMoVm' opID=SWI-54ad408b] [VpxaMoVm::CheckMoVm] did not find a VM with ID 17 in the vmList

$ tac file.txt | sort -uk4         
May 30 07:50:07 servername.com Fdm: [FF93DB90 verbose 'Cluster' opID=SWI-56f32f43] Updating inventory manager with 1 datastores
May 21 12:05:02 servername.com Fdm: [FF93DB90 verbose 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::UpdateDatastoreLockStatus] Lock state change to 4 for datastore /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050
May 29 22:25:19 servername.com Fdm: [FFB03B90 verbose 'Invt' opID=SWI-65391264] [DsStateChange::SaveToInventory] Processing locked error update for /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050 (<unset>) from __localhost__
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaAlarm' opID=SWI-54ad408b] [VpxaAlarm] VM with vmid = 17 not found
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaMoVm' opID=SWI-54ad408b] [VpxaMoVm::CheckMoVm] did not find a VM with ID 17 in the vmList

Upvotes: 2

Related Questions