Peter.k
Peter.k

Reputation: 1548

How to join data frames by column when key is not unique in R

I can't find similar answer on stack overflow. I have 2 data frames: a:

                    X     match_nr                                                     matches
12               ACADIA.3        1                                 ACADIA Pharmaceuticals Inc.
13               ACADIA.3        2 ACADIA PHARMACEUTICALS STOCK WKN 603035 | ISIN US0042251084
14                 Acorda        1                                    Acorda Therapeutics Inc.
15 Advanced Accelerator.1        1                                                            
16 Advanced Accelerator.2        1                                                            

b:

                       X       id dataset_code database_code
5               ACADIA.3 10973832         ACHC          WIKI
6                 Acorda 10974692         ACOR          WIKI
7 Advanced Accelerator.1 10973902         AXDX          WIKI

How to merge them to have "left.join" by X? I used dplyr::left.join, merge and cbind.fill and in all cases the results data.frame are much bigger than both of them, because X is not unique in a (but is unique in b). All I need is to append missing columns and repeat all values for rows where X=X.

EDIT

I'm not sure I'm clear with my explanations. I can draw what I want to achieve: c:

X               matches name    dataset_code    database_code
ACADIA.3             1  ACADIA Pharmaceuticals  10973832    ACHC
ACADIA.3             2  ACADIA PHARMACEUTICALS  10973832    ACHC
Acorda               1  Acorda Therapeutics     10974692    ACOR
Advanced Accelerator.1  1                       10973902    AXDX
Advanced Accelerator.2  1           

Is it really hard to understand? My closest code is:
cdf <- NULL
for (i in 1:nrow(a)) {
  xdf <- cbind.data.frame(a[i,], b[b$X == a$X[i],])
  cdf <- rbind.data.frame(cdf, xdf)
}

but it missing the last row of a (only 4) - should be exactly the same number as a

Upvotes: 1

Views: 3708

Answers (2)

IRTFM
IRTFM

Reputation: 263301

The 'name' column seems to be a request for something that is not actually in either of your example datasets but if you want a result that delivers most of what you want then just use merge with all.x=TRUE:

merge(a,b,all.x=TRUE)[-c(3,6)]
                       X match_nr       id dataset_code
1               ACADIA.3        1 10973832         ACHC
2               ACADIA.3        2 10973832         ACHC
3                 Acorda        1 10974692         ACOR
4 Advanced Accelerator.1        1 10973902         AXDX
5 Advanced Accelerator.2        1       NA         <NA>

And if either of the columns that were dropped is what you were expecting for name, then remove their number from the exclusion vector.

Upvotes: 1

sconfluentus
sconfluentus

Reputation: 4993

Try an anti_join which is aimed at finding all unique rows, Use this to pull the a values NOT IN B, then rbind those onto B, and you should have what you are looking for:

c <- anti_join(a, b, by = "X")

This should pull the unique values from A and then

df <- rbind(b,c)

to reattach the b values which are already unique...filling in the previously missing A values

Upvotes: 0

Related Questions