Cmagelssen
Cmagelssen

Reputation: 660

Calculate a complex difference score with tidyverse?

I have a large dataset of 70 000 rows that I want to perform some operations on, but I can't find an appropriate solution.

     bib sta run course finish comment day
1    42   9   1     SG  19.88      99   1
2    42  17   2      A  19.96      11   1
3    42  27   3      B  20.92      22   1
4    42  39   4      A  19.60      11   1
5    42  48   5     SG  20.24      99   1
6    42  61   6      C  22.90      33   1
7    42  76   7      B  20.70      22   1
8    42  86   8      C  22.74      33   1
9    42  93   9      C  22.75      33   1
10   42 103  10      A  19.79      11   1
11   42 114  11      B  20.67      22   1
12   42 120  12     SG  20.10      99   1

I want to end up with a tibble that:

  1. calculates the mean finish time in SG course for each bib number on one particular day. For example, 19.88 + 20.24 + 20.10 / 3
  2. calculate a difference score for each observation in the dataset by subtracting finish from this mean SG score. For example, 19.88 - mean(SG), 19.96 - mean(SG).

I have tried the following approach:

  1. First group by day, bib and course. Then filter by SG and calculate the mean:
avg.sgtime <- df %>%
  group_by(day, bib, course) %>%
  filter(course == 'SG') %>%
  mutate(avg.sg = mean(finish))

Resulting in the following tibble

 bib   sta   run course finish comment day   avg.sg
   <int> <int> <int> <chr>   <dbl>   <int> <chr>  <dbl>
 1    42     9     1 SG       19.9      99 1       20.1
 2    42    48     5 SG       20.2      99 1       20.1
 3    42   120    12 SG       20.1      99 1       20.1
 4    42     6     1 SG       20.0      99 2       19.9
 5    42    42     5 SG       19.8      77 2       19.9
 6    42   130    15 SG       19.9      99 2       19.9
 7    42     6     1 SG       20.6      99 3       20.5
 8    42    68    12 SG       20.6      77 3       20.5
 9    42    90    15 SG       20.4      77 3       20.5


Finally I join the two tibbles together using the following syntax:


df %>% full_join(avg.sgtime) %>%
  mutate(diff = finish - avg.sg)

However, this doesn't work. It only works for the SG course but not for course A, B and C. Is there a way to fix this or is there a better solution to the problem?

bib sta run course finish comment day   avg.sg         diff
1    42   9   1     SG  19.88      99   1 20.07333 -0.193333333
2    42  17   2      A  19.96      11   1       NA           NA
3    42  27   3      B  20.92      22   1       NA           NA
4    42  39   4      A  19.60      11   1       NA           NA
5    42  48   5     SG  20.24      99   1 20.07333  0.166666667

Upvotes: 2

Views: 301

Answers (3)

Anoushiravan R
Anoushiravan R

Reputation: 21928

Thanks @Marcelo Avila for providing me with a very good hint:

I hope this is what you are looking for:

library(dplyr)

df %>%
  group_by(bib, day) %>%
  mutate(across(finish, ~ mean(.x[course == "SG"]), .names = "avg_{.col}"),
         diff = finish - avg_finish, 
         avg_finish = ifelse(course == "SG", avg_finish, NA))


# A tibble: 12 x 9
# Groups:   bib, day [1]
     bib   sta   run course finish comment   day avg_finish    diff
   <int> <int> <int> <chr>   <dbl>   <int> <int>      <dbl>   <dbl>
 1    42     9     1 SG       19.9      99     1       20.1 -0.193 
 2    42    17     2 A        20.0      11     1       NA   -0.113 
 3    42    27     3 B        20.9      22     1       NA    0.847 
 4    42    39     4 A        19.6      11     1       NA   -0.473 
 5    42    48     5 SG       20.2      99     1       20.1  0.167 
 6    42    61     6 C        22.9      33     1       NA    2.83  
 7    42    76     7 B        20.7      22     1       NA    0.627 
 8    42    86     8 C        22.7      33     1       NA    2.67  
 9    42    93     9 C        22.8      33     1       NA    2.68  
10    42   103    10 A        19.8      11     1       NA   -0.283 
11    42   114    11 B        20.7      22     1       NA    0.597 
12    42   120    12 SG       20.1      99     1       20.1  0.0267

I also added another alternative solution with a minor change, using dear @Marcelo Avila's data set:

df %>%
  group_by(bib, day) %>%
  mutate(across(finish, ~ mean(.x[select(cur_data(), course) == "SG"]), .names = "avg_{.col}"),
         diff = finish - avg_finish, 
         avg_finish = ifelse(course == "SG", avg_finish, NA))

# A tibble: 36 x 9
# Groups:   bib, day [3]
     bib   sta   run course finish comment   day avg_finish   diff
   <dbl> <dbl> <dbl> <chr>   <dbl>   <dbl> <dbl>      <dbl>  <dbl>
 1    42     9     1 SG       19.9      99     1       20.1 -0.193
 2    42    17     2 A        20.0      11     1       NA   -0.113
 3    42    27     3 B        20.9      22     1       NA    0.847
 4    42    39     4 A        19.6      11     1       NA   -0.473
 5    42    48     5 SG       20.2      99     1       20.1  0.167
 6    42    61     6 C        22.9      33     1       NA    2.83 
 7    42    76     7 B        20.7      22     1       NA    0.627
 8    42    86     8 C        22.7      33     1       NA    2.67 
 9    42    93     9 C        22.8      33     1       NA    2.68 
10    42   103    10 A        19.8      11     1       NA   -0.283
# ... with 26 more rows

Upvotes: 2

Michael Barrowman
Michael Barrowman

Reputation: 1191

You can filter your values for finish within the mutate() and calculate the mean based on those:

df %>%
  group_by(day,bib) %>%
  mutate(
    avg.sg = mean(finish[course=="SG"]),
    diff = finish - avg.sg)

Upvotes: 2

Marcelo Avila
Marcelo Avila

Reputation: 2374

Is the following what you are aiming for?

(note that I added a few random values for a second bib just to make sure the join is done properly)

The difference to your attempt is using summarise() instead of mutate() to consolidate the avg.sgtime data frame, and also dropping a few columns so that the join is not populated with NAs. Instead of dropping you can also set the relevant columns to join by passing the by argument to the left_join() function.

library(dplyr)
library(tidyr) # for join

avg.sgtime <- df %>%
  group_by(day, bib, course) %>%
  filter(course == 'SG') %>%
  summarise(avg.sg = mean(finish), .groups = "drop") %>%
  select(c(bib, day, avg.sg))

avg.sgtime
#> # A tibble: 3 x 3
#>     bib   day avg.sg
#>   <dbl> <dbl>  <dbl>
#> 1    42     1   20.1
#> 2    43     1   19.1
#> 3    44     2   19.3
df %>% left_join(avg.sgtime) %>%
  mutate(diff = finish - avg.sg)
#> Joining, by = c("bib", "day")
#> # A tibble: 36 x 9
#>      bib   sta   run course finish comment   day avg.sg   diff
#>    <dbl> <dbl> <dbl> <chr>   <dbl>   <dbl> <dbl>  <dbl>  <dbl>
#>  1    42     9     1 SG       19.9      99     1   20.1 -0.193
#>  2    42    17     2 A        20.0      11     1   20.1 -0.113
#>  3    42    27     3 B        20.9      22     1   20.1  0.847
#>  4    42    39     4 A        19.6      11     1   20.1 -0.473
#>  5    42    48     5 SG       20.2      99     1   20.1  0.167
#>  6    42    61     6 C        22.9      33     1   20.1  2.83 
#>  7    42    76     7 B        20.7      22     1   20.1  0.627
#>  8    42    86     8 C        22.7      33     1   20.1  2.67 
#>  9    42    93     9 C        22.8      33     1   20.1  2.68 
#> 10    42   103    10 A        19.8      11     1   20.1 -0.283
#> # … with 26 more rows

Created on 2021-07-04 by the reprex package (v2.0.0)

data


df <- tribble(~bib, ~sta, ~run, ~course, ~finish, ~comment, ~day,
              42,     9,     1,       "SG",    19.88,        99,     1,
              42,    17,     2,        "A",    19.96,        11,     1,
              42,    27,     3,        "B",    20.92,        22,     1,
              42,    39,     4,        "A",    19.60,        11,     1,
              42,    48,     5,       "SG",    20.24,        99,     1,
              42,    61,     6,        "C",    22.90,        33,     1,
              42,    76,     7,        "B",    20.70,        22,     1,
              42,    86,     8,        "C",    22.74,        33,     1,
              42,    93,     9,        "C",    22.75,        33,     1,
              42,   103,    10,        "A",    19.79,        11,     1,
              42,   114,    11,        "B",    20.67,        22,     1,
              42,   120,    12,       "SG",    20.10,        99,     1,
              
              43,     9,     1,       "SG",    19.12,        99,     1,
              43,    17,     2,        "A",    19.64,        11,     1,
              43,    27,     3,        "B",    20.62,        22,     1,
              43,    39,     4,        "A",    19.23,        11,     1,
              43,    48,     5,       "SG",    20.11,        99,     1,
              43,    61,     6,        "C",    22.22,        33,     1,
              43,    76,     7,        "B",    20.33,        22,     1,
              43,    86,     8,        "C",    22.51,        33,     1,
              43,    93,     9,        "C",    22.78,        33,     1,
              43,   103,    10,        "A",    19.98,        11,     1,
              43,   114,    11,        "B",    20.11,        22,     1,
              43,   120,    12,       "SG",    18.21,        99,     1,
              
              44,     9,     1,       "SG",    19.18,        99,     2,
              44,    17,     2,        "A",    19.56,        11,     2,
              44,    27,     3,        "B",    20.62,        22,     2,
              44,    39,     4,        "A",    19.20,        11,     2,
              44,    48,     5,       "SG",    20.74,        99,     2,
              44,    61,     6,        "C",    22.50,        33,     2,
              44,    76,     7,        "B",    20.60,        22,     2,
              44,    86,     8,        "C",    22.74,        33,     2,
              44,    93,     9,        "C",    22.85,        33,     2,
              44,   103,    10,        "A",    19.59,        11,     2,
              44,   114,    11,        "B",    20.27,        22,     2,
              44,   120,    12,       "SG",    18.10,        99,     2,
              )

Upvotes: 2

Related Questions