Jake Thompson
Jake Thompson

Reputation: 2843

Completing a data frame within groups using combinations from out of group

I need to add missing combinations of variables within groups, but using combinations from outside of that group. Take the following example. Here is a data frame where raters look at samples, and then answer three questions about the sample. Rater A looked a samples 1, 2, and 3, whereas Rater B looked at only samples 1 and 2. Notice also, that Rater B only completed item 7060 and 7930 for sample 1, but did not complete item 7842. I want to turn that implicit missing item scoring into an explicit missing. The goal is to add rows for items that are missing within a sample, but not add samples that are missing within raters (i.e., don't add sample 3 to Rater B, but do add missing items to samples that were rated by Rater B).

library(tidyverse)

df <- data_frame(
  rater_id = c(rep("A", 9), rep("B", 5)),
  sample_id = c(rep(1, 3), rep(2, 3), rep(3, 3), rep(1, 2), rep(2, 3)),
  item_id = c(7060, 7842, 7930, 9571, 4678, 5966, 1758, 3148, 2574,
              7060, 7930, 9571, 4678, 5966),
  score = sample(c(0, 1), size = 14, replace = TRUE)
)
df
#> # A tibble: 14 x 4
#>    rater_id sample_id item_id score
#>    <chr>        <dbl>   <dbl> <dbl>
#>  1 A                1    7060     0
#>  2 A                1    7842     0
#>  3 A                1    7930     1
#>  4 A                2    9571     0
#>  5 A                2    4678     0
#>  6 A                2    5966     1
#>  7 A                3    1758     0
#>  8 A                3    3148     1
#>  9 A                3    2574     0
#> 10 B                1    7060     0
#> 11 B                1    7930     0
#> 12 B                2    9571     0
#> 13 B                2    4678     1
#> 14 B                2    5966     1

I have tried using tidy::complete with little success. Nesting by sample_id and item_id doesn't add the missing row, because all combinations of sample_id and item_id are already present in the data.

df %>% complete(nesting(sample_id, item_id))
#> # A tibble: 14 x 4
#>    sample_id item_id rater_id score
#>        <dbl>   <dbl> <chr>    <dbl>
#>  1         1    7060 A            0
#>  2         1    7060 B            0
#>  3         1    7842 A            0
#>  4         1    7930 A            1
#>  5         1    7930 B            0
#>  6         2    4678 A            0
#>  7         2    4678 B            1
#>  8         2    5966 A            1
#>  9         2    5966 B            1
#> 10         2    9571 A            0
#> 11         2    9571 B            0
#> 12         3    1758 A            0
#> 13         3    2574 A            0
#> 14         3    3148 A            1

Adding rater_id outside of the nesting statement also fails to give the desired output. It successfully adds the missing row for sample 1, but also adds additional missing rows for sample 3. However, because Rater B was never given sample 3, we wouldn't count this as "missing" (even though technically it kind of is).

df %>% complete(rater_id, nesting(sample_id, item_id))
#> # A tibble: 18 x 4
#>    rater_id sample_id item_id score
#>    <chr>        <dbl>   <dbl> <dbl>
#>  1 A                1    7060     0
#>  2 A                1    7842     0
#>  3 A                1    7930     1
#>  4 A                2    4678     0
#>  5 A                2    5966     1
#>  6 A                2    9571     0
#>  7 A                3    1758     0
#>  8 A                3    2574     0
#>  9 A                3    3148     1
#> 10 B                1    7060     0
#> 11 B                1    7842    NA
#> 12 B                1    7930     0
#> 13 B                2    4678     1
#> 14 B                2    5966     1
#> 15 B                2    9571     0
#> 16 B                3    1758    NA
#> 17 B                3    2574    NA
#> 18 B                3    3148    NA

My ideal output would look like this:

#> # A tibble: 18 x 4
#>    rater_id sample_id item_id score
#>    <chr>        <dbl>   <dbl> <dbl>
#>  1 A                1    7060     0
#>  2 A                1    7842     0
#>  3 A                1    7930     1
#>  4 A                2    4678     0
#>  5 A                2    5966     1
#>  6 A                2    9571     0
#>  7 A                3    1758     0
#>  8 A                3    2574     0
#>  9 A                3    3148     1
#> 10 B                1    7060     0
#> 11 B                1    7842    NA
#> 12 B                1    7930     0
#> 13 B                2    4678     1
#> 14 B                2    5966     1
#> 15 B                2    9571     0

Upvotes: 0

Views: 69

Answers (1)

benc
benc

Reputation: 376

This is a bit ugly, but you could use your last attempt (adding rater_id outside the nesting statement) and then remove cases where rater-sample pairs don't exist at all. Probably several ways to do this, but this one seems to work:

df %>% 
  complete(rater_id, nesting(sample_id, item_id)) %>% # This was your approach
  group_by(rater_id, sample_id) %>% 
  mutate(count_non_nas = sum(!is.na(score))) %>% # Count up the number of non-NA scores
  filter(count_non_nas > 0) %>% # Remove cases where ALL the scores in a group are NA
  select(-count_non_nas) %>% # Drop the intermediate column
  ungroup()

df

# A tibble: 15 x 4
   rater_id sample_id item_id score
   <chr>        <dbl>   <dbl> <dbl>
 1 A                1    7060     1
 2 A                1    7842     0
 3 A                1    7930     0
 4 A                2    4678     0
 5 A                2    5966     0
 6 A                2    9571     0
 7 A                3    1758     0
 8 A                3    2574     0
 9 A                3    3148     0
10 B                1    7060     1
11 B                1    7842    NA
12 B                1    7930     1
13 B                2    4678     0
14 B                2    5966     1
15 B                2    9571     0

Upvotes: 1

Related Questions