spore234
spore234

Reputation: 3640

merge data frame with repeated observations

I have two data frames:

df1 <- data.frame(index = c(rep(2,5), rep(3,8), rep(4,6), rep(5,9)),
              start = c(0:4, 0:7, 0:5, 0:8),
              end = c(1:5, 1:8, 1:6, 1:9),
              v1= rep(1,28))


df2 <- data.frame(index = c(rep(2,5), rep(3,7), rep(4,6), rep(5,6)),
              v1 = rnorm(24))

Assume the index column represents the different observations of a person. In this example in df1 we have a person with id 2 and 5 observations, a person with id 3 has 8 observations and so on.

I now want to replace the values of v1 in df1 by the values of v1 in df2. df2 has sometimes fewer observations than df1, e.g. for persons 3 and 5.

Merging should be in a way that the v1's of df2 are taken for the first observations and if no more observations are available then NAs should be replaced.

The structure of the df1 frame should be preserved, except that the v1's should be replaced with that from df2.

Upvotes: 1

Views: 397

Answers (1)

aosmith
aosmith

Reputation: 36076

To do this via a join, you'd need a within-group index so you have an unique identifier to match on.

This is straightforward to create if df2 is already in the correct order and so any NA values after joining should be at the end of each group. You can add an end variable to df2 and the join on index and end.

I also remove v1 from the first dataset, df1.

library(dplyr)
df2 %>%
    group_by(index) %>%
    mutate(end = 1:n()) %>%
    left_join(select(df1, -v1), .)

   index start end          v1
1      2     0   1 -1.57254316
2      2     1   2 -1.09000063
3      2     2   3 -0.13775401
4      2     3   4 -0.94088741
5      2     4   5  0.45180766
6      3     0   1  2.14925746
7      3     1   2  1.79103360
8      3     2   3 -0.28344963
9      3     3   4 -0.10250375
10     3     4   5  0.21321949
11     3     5   6  0.07982287
12     3     6   7  0.37987963
13     3     7   8          NA
14     4     0   1  0.41360521
15     4     1   2 -0.28888575
16     4     2   3 -0.58786830
17     4     3   4  0.14340062
18     4     4   5  1.46434162
19     4     5   6  1.73668914
20     5     0   1  1.43112233
21     5     1   2  0.40772442
22     5     2   3 -1.77085141
23     5     3   4 -0.94653948
24     5     4   5 -1.54412791
25     5     5   6 -0.28174510
26     5     6   7          NA
27     5     7   8          NA
28     5     8   9          NA

As pointed out in the comments by @SteveBeaupre, you could replace my old habit of 1:n() with row_number().

Upvotes: 2

Related Questions