Jeparov
Jeparov

Reputation: 165

Apply left_join for specific rows

Suppose I have 2 data.frames that I want to merge. Data as follows:

options(scipen = 999)

LHS <- structure(
   
   list(
      v1 = c(1, 2, 3, 4, 5, 6),
      v2 = c("one", "one", "one", "one", "one", "one"),
      v3 = c("two", "two", "two", "two", "two", "two"),
      v4 = c("xx", "zz", NA_character_, "yy", NA_character_, NA_character_),
      v5 = c("xx", "zz", NA_character_, "yy", NA_character_, NA_character_)),
   
   .Names = c("id", "first var","second var", "var of interest 1", "var of interest 2"),
   .typeOf = c("numeric", "character", "character","character","character"),
   row.names = c(NA, -6L),
   class = "data.frame"
   
   ) # end of LHS `structure`
   
   
RHS <- structure(
   
   list(
      v1 = c(3, 5, 6, 10, 100, 1000, 10000, 100000),
      v2 = c("QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY"),
      v3 = c("QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY")),
   
   .Names = c("id", "var of interest 1", "var of interest 2"),
   .typeOf = c("numeric", "character","character"),
   row.names = c(NA, -8L),
   class = "data.frame"
   
) # end of RHS `structure`

In LHS data.frame some rows with id 3, 5 and 6 have missing values. So I want to use left_join to combine my LHS and RHS, however, I am interested only in rows with missing values in variables of interest. In other words, I want to implement left_join to rows with missing values in particular columns (e.g. var of interest 1 and var of interest 2). Note, that RHS data.frames (I would have more than 1) do not contain data for rows (id) that already present in LHS data.frame.

The only one solution I figured out was to use lapply over all rows and implement left_join for those which contain missing values in particular columns (id 3, 5 and 6). lapply could be upgraded to mclapply to speed up execution time, however, for long data.frames such operation could be pricy.

Another one possible solution is to divide data.frames into two: with and without missing values, perform left_join and combine them together.

I am looking for more gentle solution. Any ides?

UPDATED desired output:

structure(

   list(
      v1 = c(1, 2, 3, 4, 5, 6),
      v2 = c("one", "one", "one", "one", "one", "one"),
      v3 = c("two", "two", "two", "two", "two", "two"),
      v4 = c("xx", "zz", "QWERTY", "yy", "QWERTY", "QWERTY"),
      v5 = c("xx", "zz", "QWERTY", "yy", "QWERTY", "QWERTY")),

   .Names = c("id", "first var","second var", "var of interest 1", "var of interest 2"),
   .typeOf = c("numeric", "character", "character","character","character"),
   row.names = c(NA, -6L),
   class = "data.frame"

   )

Upvotes: 3

Views: 570

Answers (2)

AnilGoyal
AnilGoyal

Reputation: 26218

The following tidyverse way may also help in doing as expected for any number of vars of interest

LHS %>% left_join(RHS, by = 'id') %>%
  mutate(across(ends_with('.x'), ~coalesce(., get(str_replace(cur_column(), '.x', '.y'))))) %>%
  select(!(ends_with('.y')))

  id first var second var var of interest 1.x var of interest 2.x
1  1       one        two                  xx                  xx
2  2       one        two                  zz                  zz
3  3       one        two              QWERTY              QWERTY
4  4       one        two                  yy                  yy
5  5       one        two              QWERTY              QWERTY
6  6       one        two              QWERTY              QWERTY

If you want to rename back your cols, do this

LHS %>% left_join(RHS, by = 'id') %>%
  mutate(across(ends_with('.x'), ~coalesce(., get(str_replace(cur_column(), '.x', '.y'))))) %>%
  select(!(ends_with('.y'))) %>%
  rename_with(~str_remove(., '.x'), ends_with('.x'))

  id first var second var var of interest 1 var of interest 2
1  1       one        two                xx                xx
2  2       one        two                zz                zz
3  3       one        two            QWERTY            QWERTY
4  4       one        two                yy                yy
5  5       one        two            QWERTY            QWERTY
6  6       one        two            QWERTY            QWERTY

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 388907

data.table update join would be helpful here.

library(data.table)

setDT(LHS)
setDT(RHS)

cols <- c("var of interest 1", "var of interest 2")
LHS[RHS, on = .(id), (cols) := mget(paste0('i.', cols))]
LHS

#   id first var second var var of interest 1 var of interest 2
#1:  1       one        two                xx                xx
#2:  2       one        two                zz                zz
#3:  3       one        two            QWERTY            QWERTY
#4:  4       one        two                yy                yy
#5:  5       one        two            QWERTY            QWERTY
#6:  6       one        two            QWERTY            QWERTY

Upvotes: 4

Related Questions