Diego
Diego

Reputation: 127

Merging two Dataframes in R by ID, One is the subset of the other

I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?

Thank you in advance

P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure

dfold:

structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France", 
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L, 
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L, 
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"), 
    pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased", 
"pid"), row.names = c(NA, 5L), class = "data.frame")

dfnew:

structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000, 
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA, 
3L), class = "data.frame")

Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.

Upvotes: 1

Views: 851

Answers (1)

LAP
LAP

Reputation: 6685

Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.

> dfold
  Country Age Salary Purchased pid
1  France  NA  72000        No   1
2   Spain  27  48000       Yes   2
3 Germany  30  54000        No   3
4   Spain  38  61000        No   4
5 Germany  40     NA       Yes   5

> dfnew
  Age Salary pid
1  44  72000   1
2  27  48000   2
3  30  54000   3
4  38  61000   4
5  40  70000   5

To do this for a single column, try

dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)

> dfold
  Country Age Salary Purchased pid
1  France  NA  72000        No   1
2   Spain  27  48000       Yes   2
3 Germany  30  54000        No   3
4   Spain  38  61000        No   4
5 Germany  40  70000       Yes   5

Using it on the whole dataset was a bit trickier:

First define all common colnames except pid:

cols <- names(dfnew)[names(dfnew) != "pid"]

> cols
[1] "Age"    "Salary"

Now use mapply to replace the NA values with ifelse:

dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])

> dfold
  Country Age Salary Purchased pid
1  France  44  72000        No   1
2   Spain  27  48000       Yes   2
3 Germany  30  54000        No   3
4   Spain  38  61000        No   4
5 Germany  40  70000       Yes   5

This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use

cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]

Upvotes: 1

Related Questions