merge function causing duplicates when NA is present

Question

Merging gives me a spuriously large dataframe inducing duplicates from NAs, even if the two constituent dataframes have essentially identical content. What I'm after is a merged dataframe where any missing ID gets its own column.

Here's a worked example with two nearly identical data frames, just with NA in different positions. In normal usage, these would be ID columns, with larger data frames associated with them.

df1<-c("LJUL1994I", "GMAY1994J",NA,"WJUN1994A")
df2<-c("LJUL1994I", NA, "GMAY1994J","WJUN1994A")

What I would like is the matching to work like this:

LJUL1994I  LJUL1994I
GMAY1994J  GMAY1994J
WJUN1994A  WJUN1994A

But, what I get is this...

merge(df1,df2)
               x         y
    1  LJUL1994I LJUL1994I
    2  GMAY1994J LJUL1994I
    3        LJUL1994I
    4  WJUN1994A LJUL1994I
    5  LJUL1994I      
    6  GMAY1994J      
    7             
    8  WJUN1994A      
    9  LJUL1994I GMAY1994J
    10 GMAY1994J GMAY1994J
    11       GMAY1994J
    12 WJUN1994A GMAY1994J
    13 LJUL1994I WJUN1994A
    14 GMAY1994J WJUN1994A
    15       WJUN1994A
    16 WJUN1994A WJUN1994A

The same output happens if I fiddle with settings (i.e. all=TRUE, incomparables=NA)

Sorting and cbinding the dataframes is a brittle solution, as I want to extend this to situations where the ID columns differ in length, and may have differing numbers of NAs.

Base r solutions preferred, but I'll take package-based solutions if they're more elegant.

MrFlick · Accepted Answer

The reason your merge look funny is because you are passing in character vectors rather than data.frames. Those character vectors are being coerced to data.frames, but since they have different names, each created data.frame will have different column names so when you merge two data.frames with no overlapping column names, you get a full outer join.

You can use merge here, but merge likes to collapse the shared column rather than duplicate it. Since you know they match. Here's one attempt

df1 <- data.frame(a=c("LJUL1994I", "GMAY1994J",NA,"WJUN1994A"))
df2 <- data.frame(a=c("LJUL1994I", NA, "GMAY1994J","WJUN1994A"))
merge(df1, cbind(df2, b=df2$a), all=T)

#           a         b
# 1 GMAY1994J GMAY1994J
# 2 LJUL1994I LJUL1994I
# 3 WJUN1994A WJUN1994A
# 4            
# 5

Here i doubled the "a" column as "b" in df2 before merging to get two columns.

merge function causing duplicates when NA is present

Answers (1)

Related Questions