Reputation: 517
I am trying to clean up my log file from duplicate lines. First of all I use sort command with uniq -d flag it help me to remove duplicates, but not soled my problem.
sort pnum.log | uniq -d
Output of the sort command.
PNUM-1233: [App] [Tracker] Text
PNUM-1233: [App] [Tracker] Text
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1234: [App] [Tracker] Tex 123 ssd
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1234: [App] [Tracker] Text 123 ssd vp
Sort command remove duplicates, but unfortunately I also need to remove lines with repeated PNUM’s and keep only one unique PNUM with longes text in example output it would be “PNUM-1234: [App] [Tracker] Text 123 ssd vp” and 2 others line with PNUM-1234 should be removed from the file. How can be this achieved? Is there any linux commands like sort, which could help me to sort?
and expectation would be:
PNUM-1233: [App] [Tracker] Text
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1234: [App] [Tracker] Text 123 ssd vp
Upvotes: 0
Views: 2626
Reputation: 204731
sort | uniq -d
doesn't remove duplicates, it prints one of each batch of lines that are duplicates. You should probably be using sort -u
instead - that will remove duplicates.
But to answer the question you asked:
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++' | cut -d' ' -f2-
PNUM-1234: [App] [Tracker] Text 123 ssd vp
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1233: [App] [Tracker] Text
The first awk
command just prepends each line with its length so the subsequent sort
can sort all of the lines longest-first, then the 2nd awk
only outputs the line when it's the first occurrence of the key field value (which now is the longest line with that key value) and then the cut
removes the line length that the first awk
added.
In sequence:
$ awk '{print length($0), $0}' file
31 PNUM-1233: [App] [Tracker] Text
31 PNUM-1233: [App] [Tracker] Text
39 PNUM-1236: [App] [Tracker] Text ddfg
36 PNUM-1236: [App] [Tracker] Text ddfg
39 PNUM-1234: [App] [Tracker] Tex 123 ssd
38 PNUM-1235: [App] [Tracker] Text 1dbg
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
$
$ awk '{print length($0), $0}' file | sort -k1,1rn
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
39 PNUM-1234: [App] [Tracker] Tex 123 ssd
39 PNUM-1236: [App] [Tracker] Text ddfg
38 PNUM-1235: [App] [Tracker] Text 1dbg
36 PNUM-1236: [App] [Tracker] Text ddfg
31 PNUM-1233: [App] [Tracker] Text
31 PNUM-1233: [App] [Tracker] Text
$
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++'
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
39 PNUM-1236: [App] [Tracker] Text ddfg
38 PNUM-1235: [App] [Tracker] Text 1dbg
31 PNUM-1233: [App] [Tracker] Text
$
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++' | cut -d' ' -f2-
PNUM-1234: [App] [Tracker] Text 123 ssd vp
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1233: [App] [Tracker] Text
You didn't say which line to print if multiple lines for the same key value are the same length so the above will just output one of them at random. If that's an issue then you can use GNU sort and add the -s
argument (for stable sort
) or change the command line to awk '{print length($0), NR, $0}' file | sort -k1,1rn -k2,2n | awk '!seen[$3]++' | cut -d' ' -f3-
- in both cases that would ensure the line output in such a conflict would be the first one that was present in the input.
Upvotes: 2
Reputation: 12827
You could use the following and for debug purposes you may use the extra listed. It may help anyone searching.
sort -u pnum.log #this will remove the duplicates
uniq –c pnum.log #this will print the count
uniq –u pnum.log #prints only unique lines of the log
Upvotes: -1
Reputation: 724
These commands should be able to remove lines with duplicate/repeated PNUM’s from your pnum.log file, keep only one unique PNUM with longest text, and maintain their relative line order:
cat pnum.log | awk -F":" '
{ if (length($0)> length(line[$1])) {line[$1]=NR":"$0} }
END { for (key in line) { print line[key] } }
' | sort -t: -nk1 | cut -d: -f2-
Upvotes: 1
Reputation: 142080
Because the first field seem to have constant count of characters, you could:
uniq -w 10 file
Upvotes: 2
Reputation: 1562
Assuming that you have already removed the duplicated rows, you could use the following awk
statement to print only unique rows based on the first column PUM-XXXX
but selecting the longest one.
awk -F":" '{ if (length($0)> length(to_print[$1])) {to_print[$1]=$0} } END { for (key in to_print) { print to_print[key] } }'
So, you will have to create an array to_print
that will keep track of the largest row. and at the end, it will print that array.
Upvotes: 2