Reputation: 627
I searched various join questions and none seemed to quite answer this. I have two dataframes which each have an ID column and several information columns.
df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25),
rep(NA, 25)), phase = c(rep("liquid", 50), rep("gas", 50)),
rand.col = rnorm(100))
df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))
As you can see, df1 is missing some info that is present in df2, while df2 is only a subset of all the ids, but they both have some similar columns. Is there a way to fill the missing values in df1 based on matching ID's from DF2?
I found a similar question that recommended using merge, but when I tried it, it dropped all the id's that were not present in both dataframes. Plus it required manually dropping duplicate columns and in my real dataset, there will be a large number of these, making doing so cumbersome. Even ignoring that though,
both the recommended solutions:
df1 <- setNames(merge(df1, df2)[-2], names(df1))
and
df1[is.na(df1$color), "color"] <- df2[match(df1$id, df2$id), "color"][which(is.na(df1$color))]
did not work for me, throwing various errors.
An alternate solution I have thought of is using rbind
then dropping incomplete cases. The problem is that in my real dataset, while there are shared columns, there are also non-shared columns so I would have to create intermediate objects of just the shared columns, rbind
, then drop incomplete cases, then join
with the original object to regain the dropped columns. This seems unnecessarily roundabout.
In this example it would look like
df2 = rbind(df1[,colnames(df2)], df2)
df2 = df2[complete.cases(df2),]
df2 = merge(df1[,c("id", "rand.col")], df2, by = "id")
and, in case there are any fully duplicated rows between the two dataframes, I would need to add
df2 = unique(df2)
This solution will work, but it is cumbersome and as the number of columns that are being matched on increase, it gets even worse. Is there a better solution?
-edit- fixed a problem in my example data pointed out by Sathish
-edit2- Expanded example data
df1 = data.frame(id = c(1:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
These dataframe represents the case where there are many columns that have incomplete data and a second dataframe that has all of the missing data. Ideally, we would not need to separately list each each column with wq1 := i.wq1
etc.
Upvotes: 2
Views: 2031
Reputation: 12703
If you want to join only by id
column, you can remove phase
in the on
clause of code below.
Also your data in the question has discrepancies, which are corrected in the data posted in this answer.
library('data.table')
setDT(df1) # make data table by reference
setDT(df2) # make data table by reference
df1[ i = df2, color := i.color, on = .(id, phase)] # join df1 with df2 by id and phase values, and replace color values of df2 with color values of df1
tail(df1)
# id color phase rand.col
# 1: 95 green gas 1.5868335
# 2: 96 green gas 0.5584864
# 3: 97 green gas -1.2765922
# 4: 98 green gas -0.5732654
# 5: 99 green gas -1.2246126
# 6: 100 green gas -0.4734006
one-liner:
setDT(df1)[df2, color := i.color, on = .(id, phase)]
Data:
set.seed(1L)
df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25),
rep(NA, 50)), phase = c(rep("liquid", 50), rep("gas", 50)),
rand.col = rnorm(100))
df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))
EDIT: based on new data posted in the question
Data:
set.seed(1L)
df1 = data.frame(id = c(1:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
set.seed(2423L)
df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
Code:
library('data.table')
setDT(df1)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.1836433 -0.6120264 0.04211587 -0.01855983
setDT(df2)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687
df1[df2, `:=` ( wq2 = i.wq2,
wq3 = i.wq3,
wq4 = i.wq4,
wq5 = i.wq5), on = .(id)]
setDT(df1)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687
Upvotes: 3