Reputation: 165
Suppose I have 2 data.frames that I want to merge. Data as follows:
options(scipen = 999)
LHS <- structure(
list(
v1 = c(1, 2, 3, 4, 5, 6),
v2 = c("one", "one", "one", "one", "one", "one"),
v3 = c("two", "two", "two", "two", "two", "two"),
v4 = c("xx", "zz", NA_character_, "yy", NA_character_, NA_character_),
v5 = c("xx", "zz", NA_character_, "yy", NA_character_, NA_character_)),
.Names = c("id", "first var","second var", "var of interest 1", "var of interest 2"),
.typeOf = c("numeric", "character", "character","character","character"),
row.names = c(NA, -6L),
class = "data.frame"
) # end of LHS `structure`
RHS <- structure(
list(
v1 = c(3, 5, 6, 10, 100, 1000, 10000, 100000),
v2 = c("QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY"),
v3 = c("QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY")),
.Names = c("id", "var of interest 1", "var of interest 2"),
.typeOf = c("numeric", "character","character"),
row.names = c(NA, -8L),
class = "data.frame"
) # end of RHS `structure`
In LHS data.frame some rows with id 3, 5 and 6 have missing values. So I want to use left_join
to combine my LHS and RHS, however, I am interested only in rows with missing values in variables of interest. In other words, I want to implement left_join
to rows with missing values in particular columns (e.g. var of interest 1 and var of interest 2). Note, that RHS data.frames (I would have more than 1) do not contain data for rows (id) that already present in LHS data.frame.
The only one solution I figured out was to use lapply
over all rows and implement left_join
for those which contain missing values in particular columns (id 3, 5 and 6). lapply
could be upgraded to mclapply
to speed up execution time, however, for long data.frames such operation could be pricy.
Another one possible solution is to divide data.frames into two: with and without missing values, perform left_join and combine them together.
I am looking for more gentle solution. Any ides?
UPDATED desired output:
structure(
list(
v1 = c(1, 2, 3, 4, 5, 6),
v2 = c("one", "one", "one", "one", "one", "one"),
v3 = c("two", "two", "two", "two", "two", "two"),
v4 = c("xx", "zz", "QWERTY", "yy", "QWERTY", "QWERTY"),
v5 = c("xx", "zz", "QWERTY", "yy", "QWERTY", "QWERTY")),
.Names = c("id", "first var","second var", "var of interest 1", "var of interest 2"),
.typeOf = c("numeric", "character", "character","character","character"),
row.names = c(NA, -6L),
class = "data.frame"
)
Upvotes: 3
Views: 570
Reputation: 26218
The following tidyverse
way may also help in doing as expected for any number of vars of interest
LHS %>% left_join(RHS, by = 'id') %>%
mutate(across(ends_with('.x'), ~coalesce(., get(str_replace(cur_column(), '.x', '.y'))))) %>%
select(!(ends_with('.y')))
id first var second var var of interest 1.x var of interest 2.x
1 1 one two xx xx
2 2 one two zz zz
3 3 one two QWERTY QWERTY
4 4 one two yy yy
5 5 one two QWERTY QWERTY
6 6 one two QWERTY QWERTY
If you want to rename back your cols, do this
LHS %>% left_join(RHS, by = 'id') %>%
mutate(across(ends_with('.x'), ~coalesce(., get(str_replace(cur_column(), '.x', '.y'))))) %>%
select(!(ends_with('.y'))) %>%
rename_with(~str_remove(., '.x'), ends_with('.x'))
id first var second var var of interest 1 var of interest 2
1 1 one two xx xx
2 2 one two zz zz
3 3 one two QWERTY QWERTY
4 4 one two yy yy
5 5 one two QWERTY QWERTY
6 6 one two QWERTY QWERTY
Upvotes: 3
Reputation: 388907
data.table
update join would be helpful here.
library(data.table)
setDT(LHS)
setDT(RHS)
cols <- c("var of interest 1", "var of interest 2")
LHS[RHS, on = .(id), (cols) := mget(paste0('i.', cols))]
LHS
# id first var second var var of interest 1 var of interest 2
#1: 1 one two xx xx
#2: 2 one two zz zz
#3: 3 one two QWERTY QWERTY
#4: 4 one two yy yy
#5: 5 one two QWERTY QWERTY
#6: 6 one two QWERTY QWERTY
Upvotes: 4