tjebo
tjebo

Reputation: 23797

R Merge only parts of one column into an existent column from another dataframe

I am aware that merging is a widely covered topic. If you think this is a duplicate, I am very happy to be put onto the question that answers my question, but I haven't found it (Sorry!). Thanks

I have two data frames:

require(dplyr)
set.seed(1)
large_df <- data_frame(id = rep(paste0('id',1:40), each = 3),
                           age = c(rep(NA,60),rep (sample(20), each = 3)),
                           col3 = rep(letters[1:20],6), col4 = rep(1:60,2))
small_df <- data_frame(id = paste0('id',1:20),
                         age = sample(20))

large_df contains incomplete data (large_df$age), which is contained in small_df. Now I would like to bring the information from small_df$age into large_df$age (merged by the correct 'id'). I think this must be possible via merge or one of the join functions from dplyr, but several combinations did not bring the result I would like.

I also tried a for loop over the rows:

for(i in nrow(large_df)) {
  if (large_df[i,'id'] %in% small_df$id == TRUE) {
    large_df[i,'age'] <- small_df$age[which(small_df$id %in% large_df[i,'id'])]
  }
}

But this doesnt help, it doesn't even return any result. (Anyone an idea why not?)

My result would look like that:

large_df$age[1:60] <- rep(small_df$age, each = 3)
large_df
# A tibble: 120 x 4
   id      age col3   col4
   <chr> <int> <chr> <int>
 1 id1       6 a         1
 2 id1       6 b         2
 3 id1       6 c         3
 4 id2       8 d         4
 5 id2       8 e         5
 6 id2       8 f         6
 7 id3      11 g         7
 8 id3      11 h         8
 9 id3      11 i         9
10 id4      16 j        10
# ... with 110 more rows

Upvotes: 1

Views: 357

Answers (1)

Andrew Chisholm
Andrew Chisholm

Reputation: 6567

Using your data frames this would do the trick.

result = 
  large_df %>% 
  left_join(small_df, by = 'id') %>% 
  mutate(age = ifelse(is.na(age.x), age.y, age.x)) %>%
  dplyr::select(-age.x, -age.y)
result
# A tibble: 120 x 4
      id  col3  col4   age
   <chr> <chr> <int> <int>
 1   id1     a     1    19
 2   id1     b     2    19
 3   id1     c     3    19
 4   id2     d     4     5

If both age.x and age.y are NA then NA would be output in age.

Upvotes: 3

Related Questions