Reputation: 5
I'm back again with another awk
question.
I have multiple large files that contain data i want to dedupe against each other. Lets say I have the following data for one month:
fruit number rand
apple 12 342
taco 19 264
tortilla 2234 53423
tortillas 2 3431
apricot 13221 23424
apricots 24234 3252
pineapple 2342 2342
radish 1 3
The following month I receive this data:
fruit number rand
pineapple 2 698
apple 34 472
taco 19 234
tortilla 16 58
tortillas 87 25
potato 234 2342
radish 1 55
grapes 9 572 422
apricot 13221 24
What I am trying to do is take the second file, and check the values of the first column to see if there are items that exist in the first file. If yes, I want to remove them from the second file, leaving only items that are unique to the second file with relation to the first one.
The desired outcome would leave me something like this:
fruit number rand DUPLICATE
pineapple 2 698 DUPE
apple 34 472 DUPE
taco 19 234 DUPE
tortilla 16 58 DUPE
tortillas 87 25 DUPE
potato 234 2342
radish 1 55 DUPE
grapes 9 572 422
apricot 13221 24 DUPE
Or, more clearly:
fruit number rand
potato 234 2342
grapes 9 572 422
I was trying to think of a way to do this without having to sort the files. I was trying to modify the answer from @karafka for a related question. Instead of passing the same file twice, I tried inputting the two different files. Obviously I'm doing something wrong.
awk 'BEGIN { FS = OFS = " " }
NR==FNR {a[$1]++; next}
FNR==1 {print $0, "DUPLICATE"; next}
$1 in a{if (a[$1]>1){print $(NF+1)="DUPE";delete a[$1]}}1' file{,}
I'm still learning awk, any help the community can provide is greatly appreciated, but I'll try to explain what I think the above program does.
In the test files I keep getting everything marked as "DUPE" or nothing at all.
I've also thought of combining the files and deduping that way, but that would leave me with undesired left-over values from the first file.
What am I doing wrong?
Upvotes: 0
Views: 140
Reputation: 204721
I think what you're doing wrong is just trying to use a bunch of scripts that are unrelated to your current problem as your starting point.
It sounds like all you need is:
$ awk '
NR==FNR { file1[$1]; next }
FNR==1 || !($1 in file1)
' file1 file2
fruit number rand
potato 234 2342
grapes 9 572 422
Upvotes: 2