K.Thanvi
K.Thanvi

Reputation: 63

sort and remove duplicate based on different columns in a file

I have a file in which there are three columns as (yyyy-mm-dd hh:mm:ss.000 12-digit number) :

2016-11-30 23:40:45.578 5001234567890
2016-11-30 23:40:45.568 5001234567890
2016-11-30 23:40:45.578 5001234567890
2016-11-30 23:40:45.478 5001234567891
2016-11-30 23:40:45.578 5001234567891

I want to first sort the file based on the date-time(first two columns) and then have to remove the rows having duplicate numbers (third column).
So after this the above file will look like:

2016-11-30 23:40:45.478 5001234567891
2016-11-30 23:40:45.568 5001234567890

I have used sort with key and awk command(as below) but the results aren't correct..(I am not very sure which entries are being removed as the file that I am processing are too big.)
Commands:

sort -k1 inputFile > sortedInputFile<br/>
awk '!seen[$3]++' sortedInputFile > outputFile<br/>

I am not sure how to do this.

Upvotes: 1

Views: 3063

Answers (3)

rici
rici

Reputation: 241931

If you want to keep the earliest instance of each 3rd column entry, you can sort twice; the first time to group duplicates and the second time to restore the sort by time, after duplicates are removed. (The following assumes a default sort works with both dates and values and that all lines have three columns with consistent whitespace.)

sort -k3 -k1,2 inputFile | uniq -f2 | sort > sortedFile

The -f2 option to uniq tells it to start the comparison at the end of the second field, so that the date fields are not considered.

Upvotes: 1

James Brown
James Brown

Reputation: 37464

Here is one in awk. It groups on the $3 and stores the earliest timestamp but the output order is random, so the output should be piped to sort.

$ awk '
    (a[$3] == "" || a[$3] > ($1 OFS $2)) && a[$3]=($1 OFS $2) { next } 
    END{ for(i in a) print a[i], i }
' file # | sort goes here
2016-11-30 23:40:45.568 5001234567890
2016-11-30 23:40:45.478 5001234567891

Upvotes: 0

bprasanna
bprasanna

Reputation: 2453

If milliseconds doesn't matter, following is another approach which removes the milliseconds and performs the sort and uniq:

awk '{print $1" "substr($2,1,index($2,".")-1)" "$3 }' file1.txt | sort | uniq

Upvotes: 1

Related Questions