user6329667
user6329667

Reputation: 517

remove duplicate lines in log file

I am trying to clean up my log file from duplicate lines. First of all I use sort command with uniq -d flag it help me to remove duplicates, but not soled my problem.

sort pnum.log | uniq -d

Output of the sort command.

PNUM-1233: [App] [Tracker] Text
PNUM-1233: [App] [Tracker] Text
PNUM-1236: [App] [Tracker] Text ddfg   
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1234: [App] [Tracker] Tex 123  ssd
PNUM-1235: [App] [Tracker] Text 1dbg  
PNUM-1234: [App] [Tracker] Text 123 ssd vp

Sort command remove duplicates, but unfortunately I also need to remove lines with repeated PNUM’s and keep only one unique PNUM with longes text in example output it would be “PNUM-1234: [App] [Tracker] Text 123 ssd vp” and 2 others line with PNUM-1234 should be removed from the file. How can be this achieved? Is there any linux commands like sort, which could help me to sort?

and expectation would be:

PNUM-1233: [App] [Tracker] Text
PNUM-1236: [App] [Tracker] Text ddfg   
PNUM-1235: [App] [Tracker] Text 1dbg  
PNUM-1234: [App] [Tracker] Text 123 ssd vp

Upvotes: 0

Views: 2626

Answers (5)

Ed Morton
Ed Morton

Reputation: 204731

sort | uniq -d doesn't remove duplicates, it prints one of each batch of lines that are duplicates. You should probably be using sort -u instead - that will remove duplicates.

But to answer the question you asked:

$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++' | cut -d' ' -f2-
PNUM-1234: [App] [Tracker] Text 123 ssd vp
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1233: [App] [Tracker] Text

The first awk command just prepends each line with its length so the subsequent sort can sort all of the lines longest-first, then the 2nd awk only outputs the line when it's the first occurrence of the key field value (which now is the longest line with that key value) and then the cut removes the line length that the first awk added.

In sequence:

$ awk '{print length($0), $0}' file
31 PNUM-1233: [App] [Tracker] Text
31 PNUM-1233: [App] [Tracker] Text
39 PNUM-1236: [App] [Tracker] Text ddfg
36 PNUM-1236: [App] [Tracker] Text ddfg
39 PNUM-1234: [App] [Tracker] Tex 123  ssd
38 PNUM-1235: [App] [Tracker] Text 1dbg
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
$
$ awk '{print length($0), $0}' file | sort -k1,1rn
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
39 PNUM-1234: [App] [Tracker] Tex 123  ssd
39 PNUM-1236: [App] [Tracker] Text ddfg
38 PNUM-1235: [App] [Tracker] Text 1dbg
36 PNUM-1236: [App] [Tracker] Text ddfg
31 PNUM-1233: [App] [Tracker] Text
31 PNUM-1233: [App] [Tracker] Text
$
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++'
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
39 PNUM-1236: [App] [Tracker] Text ddfg
38 PNUM-1235: [App] [Tracker] Text 1dbg
31 PNUM-1233: [App] [Tracker] Text
$
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++' | cut -d' ' -f2-
PNUM-1234: [App] [Tracker] Text 123 ssd vp
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1233: [App] [Tracker] Text

You didn't say which line to print if multiple lines for the same key value are the same length so the above will just output one of them at random. If that's an issue then you can use GNU sort and add the -s argument (for stable sort) or change the command line to awk '{print length($0), NR, $0}' file | sort -k1,1rn -k2,2n | awk '!seen[$3]++' | cut -d' ' -f3- - in both cases that would ensure the line output in such a conflict would be the first one that was present in the input.

Upvotes: 2

Du-Lacoste
Du-Lacoste

Reputation: 12827

You could use the following and for debug purposes you may use the extra listed. It may help anyone searching.

sort -u pnum.log #this will remove the duplicates

uniq –c pnum.log #this will print the count 

uniq –u pnum.log #prints only unique lines of the log

Upvotes: -1

LeadingEdger
LeadingEdger

Reputation: 724

These commands should be able to remove lines with duplicate/repeated PNUM’s from your pnum.log file, keep only one unique PNUM with longest text, and maintain their relative line order:

cat pnum.log | awk -F":" '
{ if (length($0)> length(line[$1])) {line[$1]=NR":"$0} }
END { for (key in line) { print line[key] } }
' | sort -t: -nk1 | cut -d: -f2-

Upvotes: 1

KamilCuk
KamilCuk

Reputation: 142080

Because the first field seem to have constant count of characters, you could:

uniq -w 10 file

Upvotes: 2

Rdit
Rdit

Reputation: 1562

Assuming that you have already removed the duplicated rows, you could use the following awk statement to print only unique rows based on the first column PUM-XXXX but selecting the longest one.

awk -F":" '{ if (length($0)> length(to_print[$1])) {to_print[$1]=$0} } END { for (key in to_print) { print to_print[key] } }'

So, you will have to create an array to_print that will keep track of the largest row. and at the end, it will print that array.

Upvotes: 2

Related Questions