mikelcal
mikelcal

Reputation: 5

Remove all instances of the same column value across multiple files using awk

I'm back again with another awk question.

I have multiple large files that contain data i want to dedupe against each other. Lets say I have the following data for one month:

fruit   number  rand
apple   12  342
taco    19  264
tortilla    2234    53423
tortillas   2   3431
apricot 13221   23424
apricots    24234   3252
pineapple   2342    2342
radish  1   3

The following month I receive this data:

fruit   number  rand
pineapple   2   698
apple   34  472
taco    19  234
tortilla    16  58
tortillas   87  25
potato  234 2342
radish  1   55
grapes  9   572 422
apricot 13221   24

What I am trying to do is take the second file, and check the values of the first column to see if there are items that exist in the first file. If yes, I want to remove them from the second file, leaving only items that are unique to the second file with relation to the first one.

The desired outcome would leave me something like this:

fruit   number  rand    DUPLICATE
pineapple   2   698     DUPE
apple   34  472         DUPE
taco    19  234         DUPE
tortilla    16  58      DUPE
tortillas   87  25      DUPE
potato  234 2342
radish  1   55          DUPE
grapes  9   572 422
apricot 13221   24      DUPE

Or, more clearly:

fruit   number  rand
potato  234 2342
grapes  9   572 422

I was trying to think of a way to do this without having to sort the files. I was trying to modify the answer from @karafka for a related question. Instead of passing the same file twice, I tried inputting the two different files. Obviously I'm doing something wrong.

awk 'BEGIN { FS = OFS = "      " }
    NR==FNR {a[$1]++; next}
    FNR==1  {print $0, "DUPLICATE"; next}
    $1 in a{if (a[$1]>1){print $(NF+1)="DUPE";delete a[$1]}}1' file{,}

I'm still learning awk, any help the community can provide is greatly appreciated, but I'll try to explain what I think the above program does.

  1. The first line sets the delimiter and the output delimiter to be a tab character.
  2. This line reads the first file and stores an array with a count of how many times an item appears in the list.
  3. This outputs the first line which is essentially the header, adding "DUPLICATE" at the end of the last item in the row
  4. (This is where I'm stuck) If the current value is found in the array "a" it should check if the stored value is greater than one. If yes, it should print the word "DUPE" in the last column. Finally it returns the entire line.

In the test files I keep getting everything marked as "DUPE" or nothing at all.

I've also thought of combining the files and deduping that way, but that would leave me with undesired left-over values from the first file.

What am I doing wrong?

Upvotes: 0

Views: 140

Answers (1)

Ed Morton
Ed Morton

Reputation: 204721

I think what you're doing wrong is just trying to use a bunch of scripts that are unrelated to your current problem as your starting point.

It sounds like all you need is:

$ awk '
NR==FNR { file1[$1]; next }
FNR==1 || !($1 in file1)
' file1 file2
fruit   number  rand
potato  234 2342
grapes  9   572 422

Upvotes: 2

Related Questions