Reputation: 95
Merging gives me a spuriously large dataframe inducing duplicates from NAs, even if the two constituent dataframes have essentially identical content. What I'm after is a merged dataframe where any missing ID gets its own column.
Here's a worked example with two nearly identical data frames, just with NA in different positions. In normal usage, these would be ID columns, with larger data frames associated with them.
df1<-c("LJUL1994I", "GMAY1994J",NA,"WJUN1994A")
df2<-c("LJUL1994I", NA, "GMAY1994J","WJUN1994A")
What I would like is the matching to work like this:
LJUL1994I LJUL1994I
GMAY1994J GMAY1994J
WJUN1994A WJUN1994A
<NA>
<NA>
But, what I get is this...
merge(df1,df2)
x y
1 LJUL1994I LJUL1994I
2 GMAY1994J LJUL1994I
3 <NA> LJUL1994I
4 WJUN1994A LJUL1994I
5 LJUL1994I <NA>
6 GMAY1994J <NA>
7 <NA> <NA>
8 WJUN1994A <NA>
9 LJUL1994I GMAY1994J
10 GMAY1994J GMAY1994J
11 <NA> GMAY1994J
12 WJUN1994A GMAY1994J
13 LJUL1994I WJUN1994A
14 GMAY1994J WJUN1994A
15 <NA> WJUN1994A
16 WJUN1994A WJUN1994A
The same output happens if I fiddle with settings (i.e. all=TRUE
, incomparables=NA
)
Sorting and cbinding the dataframes is a brittle solution, as I want to extend this to situations where the ID columns differ in length, and may have differing numbers of NAs.
Base r solutions preferred, but I'll take package-based solutions if they're more elegant.
Upvotes: 0
Views: 378
Reputation: 206197
The reason your merge look funny is because you are passing in character vectors rather than data.frames. Those character vectors are being coerced to data.frames, but since they have different names, each created data.frame will have different column names so when you merge two data.frames with no overlapping column names, you get a full outer join.
You can use merge here, but merge likes to collapse the shared column rather than duplicate it. Since you know they match. Here's one attempt
df1 <- data.frame(a=c("LJUL1994I", "GMAY1994J",NA,"WJUN1994A"))
df2 <- data.frame(a=c("LJUL1994I", NA, "GMAY1994J","WJUN1994A"))
merge(df1, cbind(df2, b=df2$a), all=T)
# a b
# 1 GMAY1994J GMAY1994J
# 2 LJUL1994I LJUL1994I
# 3 WJUN1994A WJUN1994A
# 4 <NA> <NA>
# 5 <NA> <NA>
Here i doubled the "a" column as "b" in df2
before merging to get two columns.
Upvotes: 1