Reputation: 2843
I need to add missing combinations of variables within groups, but using combinations from outside of that group. Take the following example. Here is a data frame where raters look at samples, and then answer three questions about the sample. Rater A looked a samples 1, 2, and 3, whereas Rater B looked at only samples 1 and 2. Notice also, that Rater B only completed item 7060 and 7930 for sample 1, but did not complete item 7842. I want to turn that implicit missing item scoring into an explicit missing. The goal is to add rows for items that are missing within a sample, but not add samples that are missing within raters (i.e., don't add sample 3 to Rater B, but do add missing items to samples that were rated by Rater B).
library(tidyverse)
df <- data_frame(
rater_id = c(rep("A", 9), rep("B", 5)),
sample_id = c(rep(1, 3), rep(2, 3), rep(3, 3), rep(1, 2), rep(2, 3)),
item_id = c(7060, 7842, 7930, 9571, 4678, 5966, 1758, 3148, 2574,
7060, 7930, 9571, 4678, 5966),
score = sample(c(0, 1), size = 14, replace = TRUE)
)
df
#> # A tibble: 14 x 4
#> rater_id sample_id item_id score
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 1 7060 0
#> 2 A 1 7842 0
#> 3 A 1 7930 1
#> 4 A 2 9571 0
#> 5 A 2 4678 0
#> 6 A 2 5966 1
#> 7 A 3 1758 0
#> 8 A 3 3148 1
#> 9 A 3 2574 0
#> 10 B 1 7060 0
#> 11 B 1 7930 0
#> 12 B 2 9571 0
#> 13 B 2 4678 1
#> 14 B 2 5966 1
I have tried using tidy::complete
with little success. Nesting by sample_id
and item_id
doesn't add the missing row, because all combinations of sample_id
and item_id
are already present in the data.
df %>% complete(nesting(sample_id, item_id))
#> # A tibble: 14 x 4
#> sample_id item_id rater_id score
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 7060 A 0
#> 2 1 7060 B 0
#> 3 1 7842 A 0
#> 4 1 7930 A 1
#> 5 1 7930 B 0
#> 6 2 4678 A 0
#> 7 2 4678 B 1
#> 8 2 5966 A 1
#> 9 2 5966 B 1
#> 10 2 9571 A 0
#> 11 2 9571 B 0
#> 12 3 1758 A 0
#> 13 3 2574 A 0
#> 14 3 3148 A 1
Adding rater_id
outside of the nesting statement also fails to give the desired output. It successfully adds the missing row for sample 1, but also adds additional missing rows for sample 3. However, because Rater B was never given sample 3, we wouldn't count this as "missing" (even though technically it kind of is).
df %>% complete(rater_id, nesting(sample_id, item_id))
#> # A tibble: 18 x 4
#> rater_id sample_id item_id score
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 1 7060 0
#> 2 A 1 7842 0
#> 3 A 1 7930 1
#> 4 A 2 4678 0
#> 5 A 2 5966 1
#> 6 A 2 9571 0
#> 7 A 3 1758 0
#> 8 A 3 2574 0
#> 9 A 3 3148 1
#> 10 B 1 7060 0
#> 11 B 1 7842 NA
#> 12 B 1 7930 0
#> 13 B 2 4678 1
#> 14 B 2 5966 1
#> 15 B 2 9571 0
#> 16 B 3 1758 NA
#> 17 B 3 2574 NA
#> 18 B 3 3148 NA
My ideal output would look like this:
#> # A tibble: 18 x 4
#> rater_id sample_id item_id score
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 1 7060 0
#> 2 A 1 7842 0
#> 3 A 1 7930 1
#> 4 A 2 4678 0
#> 5 A 2 5966 1
#> 6 A 2 9571 0
#> 7 A 3 1758 0
#> 8 A 3 2574 0
#> 9 A 3 3148 1
#> 10 B 1 7060 0
#> 11 B 1 7842 NA
#> 12 B 1 7930 0
#> 13 B 2 4678 1
#> 14 B 2 5966 1
#> 15 B 2 9571 0
Upvotes: 0
Views: 69
Reputation: 376
This is a bit ugly, but you could use your last attempt (adding rater_id outside the nesting statement) and then remove cases where rater-sample pairs don't exist at all. Probably several ways to do this, but this one seems to work:
df %>%
complete(rater_id, nesting(sample_id, item_id)) %>% # This was your approach
group_by(rater_id, sample_id) %>%
mutate(count_non_nas = sum(!is.na(score))) %>% # Count up the number of non-NA scores
filter(count_non_nas > 0) %>% # Remove cases where ALL the scores in a group are NA
select(-count_non_nas) %>% # Drop the intermediate column
ungroup()
df
# A tibble: 15 x 4
rater_id sample_id item_id score
<chr> <dbl> <dbl> <dbl>
1 A 1 7060 1
2 A 1 7842 0
3 A 1 7930 0
4 A 2 4678 0
5 A 2 5966 0
6 A 2 9571 0
7 A 3 1758 0
8 A 3 2574 0
9 A 3 3148 0
10 B 1 7060 1
11 B 1 7842 NA
12 B 1 7930 1
13 B 2 4678 0
14 B 2 5966 1
15 B 2 9571 0
Upvotes: 1