D500
D500

Reputation: 442

Manipulating dataset to account for repeated measures

Given:

df <- data.frame(
                  CompanyID=c("Drinkers","Drinkers","Drinkers","Drinkers","Drinkers","Drinkers","Drinkers","Drinkers"
                            ,"Drinkers","Drinkers", "Liquders","Liquders","Liquders","PelletCoffeeCo","PelletCoffeeCo"),
                  Email= c("[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]", 
                          "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]",
                          "[email protected]","[email protected]","[email protected]","[email protected]",
                        "[email protected]"),
                  Day= c("1","2","3","4","5","6","7","8","9","10","1","2","3","1","2"),
                 var1= c(4,5,5,5,2,3,2,7,6,5,7,6,6,2,3))

I need to figure out how to get to:

df2 <- data.frame(CompanyID=c("Drinkers","Drinkers","Drinkers","Drinkers","Drinkers","Drinkers","Drinkers","Drinkers"
                            ,"Drinkers","Drinkers", "Liquders","Liquders","Liquders","Liquders","Liquders","Liquders",
                            "Liquders","Liquders","Liquders","Liquders", "PelletCoffeeCo","PelletCoffeeCo","PelletCoffeeCo",
                            "PelletCoffeeCo","PelletCoffeeCo","PelletCoffeeCo","PelletCoffeeCo","PelletCoffeeCo",
                            "PelletCoffeeCo","PelletCoffeeCo"),
                  Email= c("[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]", 
                             "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]",
                           "[email protected]","[email protected]","[email protected]","[email protected]","[email protected]",
                           "[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]",
                           "[email protected]","[email protected]","[email protected]","[email protected]",
                           "[email protected]","[email protected]","[email protected]","[email protected]",
                           "[email protected]"),
                  Day= c("1","2","3","4","5","6","7","8","9","10","1","2","3","4","5","6","7","8","9","10",
                         "1","2","3","4","5","6","7","8","9","10"),
                  var1= c(4,5,5,5,2,3,2,7,6,5,7,6,6, NA,NA,NA,NA,NA,NA,NA, 2,3,NA,NA,NA,NA,NA,NA,NA,NA))

Explanation: I have data where I surveyed people once a day over a course of 10 days. In a perfect world, I would have 10 responses from each participant, denoted by day1:day10. However, due to non-response, some participants gave 3 responses, others, 6, and others 10 and etc. I'm setting the data up to run a growth model, and so I need the column "Day" to always read Day1 - Day 10, regardless if there's data for those responses. I've tried to demonstrate this by adding NA's to rows that don't have all 10 days of data.

Upvotes: 0

Views: 60

Answers (2)

www
www

Reputation: 4224

Try this:

library(tidyr)

df %>% 
  complete(nesting(CompanyID,Email), Day = seq(min(Day), max(Day), 1L)) %>%
  data.frame()

Output:

        CompanyID                  Email Day var1
1        Drinkers        [email protected]   1    4
2        Drinkers        [email protected]   2    5
3        Drinkers        [email protected]   3    5
4        Drinkers        [email protected]   4    5
5        Drinkers        [email protected]   5    5
6        Drinkers        [email protected]   6    2
7        Drinkers        [email protected]   7    3
8        Drinkers        [email protected]   8    2
9        Drinkers        [email protected]   9    7
10       Drinkers        [email protected]  10    6
11       Liquders      [email protected]   1    7
12       Liquders      [email protected]   2   NA
13       Liquders      [email protected]   3    6
14       Liquders      [email protected]   4    6
15       Liquders      [email protected]   5   NA
16       Liquders      [email protected]   6   NA
17       Liquders      [email protected]   7   NA
18       Liquders      [email protected]   8   NA
19       Liquders      [email protected]   9   NA
20       Liquders      [email protected]  10   NA
21 PelletCoffeeCo [email protected]   1    2
22 PelletCoffeeCo [email protected]   2   NA
23 PelletCoffeeCo [email protected]   3    3
24 PelletCoffeeCo [email protected]   4   NA
25 PelletCoffeeCo [email protected]   5   NA
26 PelletCoffeeCo [email protected]   6   NA
27 PelletCoffeeCo [email protected]   7   NA
28 PelletCoffeeCo [email protected]   8   NA
29 PelletCoffeeCo [email protected]   9   NA
30 PelletCoffeeCo [email protected]  10   NA

Edit:

The code above fills each group's Day column values with a complete set of Day values defined by the minimum and maximum of the existing values in that column (i.e. 1 and 10, respectively). The groups by which these Day values are filled can be redefined as needed, but I chose to define them here as Company + Email with the line "nesting(CompanyID,Email)". The data.frame() line is just there to convert the output to a data.frame instead of a tibble. If a data.frame output isn't necessary, feel free to replace or remove that line.

Upvotes: 2

pyll
pyll

Reputation: 1764

First, create a data frame of the unique company ID. Next, create a data frame of the desired days.

Cross Join these together.

Then join to your original dataset to fill out the table.

comp <- data.frame(CompanyID = unique(df$CompanyID))
Day <- data.frame(Day = c("1","2","3","4","5","6","7","8","9","10"))

compDay <- merge(comp, Day, all = TRUE)

dfday <- merge(df, compDay, by = c("CompanyID", "Day"), all = TRUE)

Upvotes: 0

Related Questions