Fill missing rows in panel dataset based on range

Question

I want to fill my Panel dataset with rows based on the existing range of the grouping variables.

To create a better understanding I will use an example dataset:

> df<-data.frame(Student=c(1, 1, 2), Year=c(1,2,2), Type=c("Test","Exam","Test"),Points=c(80,140,30))
> df
  Student Year Type Points
1       1    1 Test     80
2       1    2 Exam    140
3       2    2 Test     30

What I want to have is for every Student for every year two obervations based on the range of points. It should look like this after the transformation:

> df2<-data.frame(Student=c(1, 1, 1,1,2,2,2,2), Year=c(1,1,2,2,1,1,2,2), PointRange=c("0_100","100_200","0_100","100_200","0_100","100_200","0_100","100_200"), n_tests=c(1,0,0,0,0,0,1,0), n_exams=c(0,0,0,1,0,0,0,0))
> df2
  Student Year PointRange n_tests n_exams
1       1    1      0_100       1       0
2       1    1    100_200       0       0
3       1    2      0_100       0       0
4       1    2    100_200       0       1
5       2    1      0_100       0       0
6       2    1    100_200       0       0
7       2    2      0_100       1       0
8       2    2    100_200       0       0

I've tried the following with dplyr-package:

df %>% mutate(PointRange = case_when(Points >= 0 & Points <= 100 ~ 1, Points >= 101 & Points <= 200 ~ 2)) %>%
+         group_by(Student, Year, PointRange) %>%
+   summarise(n_tests = sum(Type == "Test"),
+             n_exams = sum(Type=="Exam"))
# A tibble: 3 x 5
# Groups:   Student, Year [?]
  Student  Year PointRange n_tests n_exams
                 
1       1     1          1       1       0
2       1     2          2       0       1
3       2     2          1       1       0

What's missing then are five rows so that for every user I have two Point-Ranges for every year. How can I fix that?

Shree · Accepted Answer

You can use cut to create ranges and then dplyr::complete() to create all combinations of student, year, and range -

result <- df %>%
  mutate(PointRange = cut(Points, breaks = c(0, 100, 200), right=  F)) %>%
  complete(Student, Year, PointRange) %>%
  group_by(Student, Year, PointRange) %>%
  summarize(
    n_tests = sum(Type == "Test", na.rm = T),
    n_exams = sum(Type == "Exam", na.rm = T)
  )

# A tibble: 8 x 5
# Groups:   Student, Year [?]
  Student  Year PointRange n_tests n_exams
                 
1    1.00  1.00 [0,100)          1       0
2    1.00  1.00 [100,200)        0       0
3    1.00  2.00 [0,100)          0       0
4    1.00  2.00 [100,200)        0       1
5    2.00  1.00 [0,100)          0       0
6    2.00  1.00 [100,200)        0       0
7    2.00  2.00 [0,100)          1       0
8    2.00  2.00 [100,200)        0       0

Fill missing rows in panel dataset based on range

Answers (1)

Related Questions