Reputation: 8567
I have two data frames containing some survey data with two columns of ID. One data frame has data on one year and the other on another year. One of the data frame has a variable that the other has not. Here's an example of these two data frames:
df1 <- data.frame(id1 = c(1, 1, 2, 2, 3, 3),
id2 = c(1, 2, 1, 2, 1, 2),
name = c("foo1", "foo2", "foo3", "foo4", "foo5", "foo6"),
variable = c(100, 110, 120, 130, 140, 150),
year = rep(2019, 6))
df2 <- data.frame(id1 = c(1, 1, 2, 2, 3, 3),
id2 = c(1, 2, 1, 2, 1, 2),
variable = c(200, 210, 220, 230, 240, 250),
year = rep(2020, 6))
I would like to have a unique df that contains two observations per person: the observation for 2019 and the one for 2020. Here's what I tried:
total <- bind_rows(df1, df2)
total <- total[order(total$id1, total$id2, total$year), ]
total <- total[c(1, 2, 3, 5, 4)]
and the df I obtain:
id1 id2 name year variable
1 1 1 foo1 2019 100
7 1 1 <NA> 2020 200
2 1 2 foo2 2019 110
8 1 2 <NA> 2020 210
3 2 1 foo3 2019 120
9 2 1 <NA> 2020 220
4 2 2 foo4 2019 130
10 2 2 <NA> 2020 230
5 3 1 foo5 2019 140
11 3 1 <NA> 2020 240
6 3 2 foo6 2019 150
12 3 2 <NA> 2020 250
There are two issues here: the first one is that there's a NA
in 50% of the cells of name
whereas I would like the names to be duplicated; the second issue is that it takes 3 steps (more if I want to duplicate the names).
Is there a function that automatically merges the two df the way I want? I would prefer a function included in the tidyr
package, but any other solution is accepted.
I know that there are many questions about merging two data frames but so far, I didn't find what I want and I can't find a tidyr
function for that.
Edit: expected output:
id1 id2 name year variable
1 1 1 foo1 2019 100
7 1 1 foo1 2020 200
2 1 2 foo2 2019 110
8 1 2 foo2 2020 210
3 2 1 foo3 2019 120
9 2 1 foo3 2020 220
4 2 2 foo4 2019 130
10 2 2 foo4 2020 230
5 3 1 foo5 2019 140
11 3 1 foo5 2020 240
6 3 2 foo6 2019 150
12 3 2 foo6 2020 250
Upvotes: 3
Views: 865
Reputation: 13135
library(dplyr)
library(tidyr)
bind_rows(df1, df2) %>% group_by(id1, id2) %>%
fill(name, .direction = 'down') %>%
arrange(id1, id2, year)
id1 id2 name variable year
1 1 1 foo1 100 2019
2 1 1 foo1 200 2020
3 1 2 foo2 110 2019
4 1 2 foo2 210 2020
5 2 1 foo3 120 2019
6 2 1 foo3 220 2020
7 2 2 foo4 130 2019
8 2 2 foo4 230 2020
9 3 1 foo5 140 2019
10 3 1 foo5 240 2020
11 3 2 foo6 150 2019
12 3 2 foo6 250 2020
To fill the missing columns without specifying the column names, we can do
bind_rows(df1, df2) %>% group_by(id1, id2) %>%
arrange(id1, id2, year) %>% ungroup() %>%
fill(select_if(.,~any(is.na(.))) %>% names() ,.direction = 'down')
#tidyr v1.0.0
full_join(df1, df2, by=c('id1', 'id2'), suffix=c('.df1', '.df2')) %>%
pivot_longer(cols=matches('.df[12]$'), names_to = c(".value","Ind"), names_sep = "\\.")
Upvotes: 2
Reputation: 6073
Sounds like you just want to rbind()
the two data.frames, but with the extra step of adding the "name" column to df2
first.
Step 1: add "name" to df2
df2 <- merge(df2, df1[,c("id1", "id2", "name")],
by=c("id1", "id2"), all.x=TRUE)
Step 2: append the two data.frames
# base R approach
result <- rbind(df1, df2)
# tidyverse approach
result <- bind_rows(df1, df2)
[optional] Step 3: sort
# base R approach
result[order(result$name, result$year), ]
# tidyverse approach
result %>% arrange(name, year)
Upvotes: 2