bretauv
bretauv

Reputation: 8567

R: how can I merge two data frames so that I obtain panel data?

I have two data frames containing some survey data with two columns of ID. One data frame has data on one year and the other on another year. One of the data frame has a variable that the other has not. Here's an example of these two data frames:

df1 <- data.frame(id1 = c(1, 1, 2, 2, 3, 3), 
                  id2 = c(1, 2, 1, 2, 1, 2),
                  name = c("foo1", "foo2", "foo3", "foo4", "foo5", "foo6"),
                  variable = c(100, 110, 120, 130, 140, 150),
                  year = rep(2019, 6))

df2 <- data.frame(id1 = c(1, 1, 2, 2, 3, 3), 
                  id2 = c(1, 2, 1, 2, 1, 2),
                  variable = c(200, 210, 220, 230, 240, 250),
                  year = rep(2020, 6))

I would like to have a unique df that contains two observations per person: the observation for 2019 and the one for 2020. Here's what I tried:

total <- bind_rows(df1, df2)
total <- total[order(total$id1, total$id2, total$year), ]
total <- total[c(1, 2, 3, 5, 4)]

and the df I obtain:

 id1 id2 name year variable
1    1   1 foo1 2019      100
7    1   1 <NA> 2020      200
2    1   2 foo2 2019      110
8    1   2 <NA> 2020      210
3    2   1 foo3 2019      120
9    2   1 <NA> 2020      220
4    2   2 foo4 2019      130
10   2   2 <NA> 2020      230
5    3   1 foo5 2019      140
11   3   1 <NA> 2020      240
6    3   2 foo6 2019      150
12   3   2 <NA> 2020      250

There are two issues here: the first one is that there's a NA in 50% of the cells of name whereas I would like the names to be duplicated; the second issue is that it takes 3 steps (more if I want to duplicate the names).

Is there a function that automatically merges the two df the way I want? I would prefer a function included in the tidyr package, but any other solution is accepted.

I know that there are many questions about merging two data frames but so far, I didn't find what I want and I can't find a tidyr function for that.

Edit: expected output:

 id1 id2 name year variable
1    1   1 foo1 2019      100
7    1   1 foo1 2020      200
2    1   2 foo2 2019      110
8    1   2 foo2 2020      210
3    2   1 foo3 2019      120
9    2   1 foo3 2020      220
4    2   2 foo4 2019      130
10   2   2 foo4 2020      230
5    3   1 foo5 2019      140
11   3   1 foo5 2020      240
6    3   2 foo6 2019      150
12   3   2 foo6 2020      250

Upvotes: 3

Views: 865

Answers (2)

A. Suliman
A. Suliman

Reputation: 13135

library(dplyr)
library(tidyr)
bind_rows(df1, df2) %>% group_by(id1, id2) %>% 
    fill(name, .direction = 'down') %>% 
    arrange(id1, id2, year)

   id1 id2 name variable year
1    1   1 foo1      100 2019
2    1   1 foo1      200 2020
3    1   2 foo2      110 2019
4    1   2 foo2      210 2020
5    2   1 foo3      120 2019
6    2   1 foo3      220 2020
7    2   2 foo4      130 2019
8    2   2 foo4      230 2020
9    3   1 foo5      140 2019
10   3   1 foo5      240 2020
11   3   2 foo6      150 2019
12   3   2 foo6      250 2020

To fill the missing columns without specifying the column names, we can do

bind_rows(df1, df2) %>% group_by(id1, id2) %>% 
    arrange(id1, id2, year) %>% ungroup() %>%
    fill(select_if(.,~any(is.na(.))) %>% names() ,.direction = 'down')

Robust solution:

#tidyr v1.0.0
full_join(df1, df2, by=c('id1', 'id2'), suffix=c('.df1', '.df2')) %>% 
    pivot_longer(cols=matches('.df[12]$'), names_to = c(".value","Ind"), names_sep = "\\.")

Upvotes: 2

DanY
DanY

Reputation: 6073

Sounds like you just want to rbind() the two data.frames, but with the extra step of adding the "name" column to df2 first.

Step 1: add "name" to df2

df2 <- merge(df2, df1[,c("id1", "id2", "name")], 
             by=c("id1", "id2"), all.x=TRUE)

Step 2: append the two data.frames

# base R approach
result <- rbind(df1, df2)

# tidyverse approach
result <- bind_rows(df1, df2)

[optional] Step 3: sort

# base R approach 
result[order(result$name, result$year), ]

# tidyverse approach
result %>%  arrange(name, year)

Upvotes: 2

Related Questions