Reputation: 111
My previous question (here) just got a bit more complicated.
I need to identify duplicate entries in a comma-separated text file, based on a combination of columns 4 and 5 that represent 4: an identifier, 5: a date. Those two columns can simply be concatenated.
If there are duplicates, I need to perform a check based on the substring of another column, which represents an int value. The duplicate with the higher value should be kept.
Example for creating the substring: cut -d"," -f14 file.txt | grep -o "N0[0-9]{3}" | sed 's/N0//g'
If the int values are the same, I need to perform yet another check based on column 9, which represents a date. The row with the later date should be kept.
Example data:
xyz,xyz,xyz,X08ABC,20200804,xyz,xyz,xyz,2020-08-17T20:08:27.000000Z,xyz,xyz,xyz,xyz,long_string_N0809_bla
xyz,xyz,xyz,X08ABC,20200804,xyz,xyz,xyz,2020-09-25T16:18:43.000000Z,xyz,xyz,xyz,xyz,long_string_N0809_bla
In the example, the third columns would be kept while the first two are discarded. They are all duplicates (col4+col5 = the same), the int value in col 14 is highest in row 2 and 3, date in col 9 is later in row 3 than row 2.
If I can keep the order of rows in the original file, that is a plus - but not a must.
Edit: Having an awk solution would be great, if this can be accomplished in a somewhat readable fashion in awk. I assume that's possible, but I get very confused with multiple operations within awk.
Upvotes: 0
Views: 818
Reputation: 111
It took me quite some time reading through other questions, manuals, etc, but I think this should be a possible solution. It does need sorting though, so the order of rows unfortunately doesn't stay the same. Also, it relies on the column entries always having the same length, which has a lot of potential for trouble.
sort -t, -k 14.15,14.17r -k9.1,9.4r -k9.6,9.7r -k9.9,9.10r file.txt | awk -F, '{OFS=","} {print $0, $4"_"$5}' | awk -F, '!a[$15]++'
This sorts the rows by the int value subset from column 14 (809), and by the date in column 9 (sort by year, month, date). Then, a new column that combines col 4 and col5 is created, to identify duplicates. The last awk command only keeps those duplicate lines that appear first - which is also the reason why everything needed to be sorted first.
Improvements are welcome!
Upvotes: 1