Metsfan
Metsfan

Reputation: 520

How can I calculate difference between between 2 non-date values in same column in same group?

This is a follow-up question to this post: In R how can I count the number of grouped pairs in which one row's column value is greater than another?

Here is my dput for the dataset df1:

structure(list(Name = c("A.J. Ellis", "A.J. Ellis", "A.J. Pierzynski", 
"A.J. Pierzynski", "Aaron Boone", "Adam Kennedy", "Adam Melhuse", 
"Adrian Beltre", "Adrian Beltre", "Adrian Gonzalez", "Alan Zinter", 
"Albert Pujols", "Albert Pujols"), Age = c(37, 36, 37, 36, 36, 
36, 36, 37, 36, 36, 36, 37, 36), Year = c(2018, 2017, 2014, 2013, 
2009, 2012, 2008, 2016, 2015, 2018, 2004, 2017, 2016), Tm = c("SDP", 
"MIA", "TOT", "TEX", "HOU", "LAD", "TOT", "TEX", "TEX", "NYM", 
"ARI", "LAA", "LAA"), Lg = c("NL", "NL", "ML", "AL", "NL", "NL", 
"ML", "AL", "AL", "NL", "NL", "AL", "AL"), G = c(66, 51, 102, 
134, 10, 86, 15, 153, 143, 54, 28, 149, 152), PA = c(183, 163, 
362, 529, 14, 201, 32, 640, 619, 187, 40, 636, 650)), row.names = c(NA, 
13L), class = "data.frame")

Here is the code that for my previous problem matched the pairs up correctly:

df1 %>%
  arrange(Name, Age) %>%
  group_by(Name) %>%
  filter(last(G) < first(G))

Each grouped pair has two observations. Each also has a column named G and a column Year.

Here is how the data looks after being grouped using the above code: https://www.dropbox.com/s/hh2qgkbn4cy4k4l/Data%20after%20grouping.png?dl=0

Now, what I would like to know for each matched pair is the difference in the "G column" value between the 'Age 37' value and the 'Age 36' value by doing this calculation: (Age 36 value) - (Age 37 value). A negative result is okay.

Also, for all the matched pairs in the dataset, I would like the sum of those differences.

Upvotes: 0

Views: 100

Answers (1)

Joris
Joris

Reputation: 417

If I understand you correctly:

df <- structure(list(Name = c("A.J. Ellis", "A.J. Ellis", "A.J. Pierzynski", 
                        "A.J. Pierzynski", "Aaron Boone", "Adam Kennedy", "Adam Melhuse", 
                        "Adrian Beltre", "Adrian Beltre", "Adrian Gonzalez", "Alan Zinter", 
                        "Albert Pujols", "Albert Pujols"), Age = c(37, 36, 37, 36, 36, 
                                                                   36, 36, 37, 36, 36, 36, 37, 36), Year = c(2018, 2017, 2014, 2013, 
                                                                                                             2009, 2012, 2008, 2016, 2015, 2018, 2004, 2017, 2016), Tm = c("SDP", 
                                                                                                                                                                           "MIA", "TOT", "TEX", "HOU", "LAD", "TOT", "TEX", "TEX", "NYM", 
                                                                                                                                                                           "ARI", "LAA", "LAA"), Lg = c("NL", "NL", "ML", "AL", "NL", "NL", 
                                                                                                                                                                                                        "ML", "AL", "AL", "NL", "NL", "AL", "AL"), G = c(66, 51, 102, 
                                                                                                                                                                                                                                                         134, 10, 86, 15, 153, 143, 54, 28, 149, 152), PA = c(183, 163, 
                                                                                                                                                                                                                                                                                                              362, 529, 14, 201, 32, 640, 619, 187, 40, 636, 650)), row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                                                                  13L), class = "data.frame")
df1 <- df %>%
  arrange(Name, Age) %>%
  group_by(Name) %>%
  filter(last(G) < first(G)) %>% 
  mutate(g_diff = G[1] - G[2]) %>% 
  ungroup() %>% 
  mutate(sum_g_diff = sum(unique(g_diff)))

> df1
# A tibble: 4 x 9
  Name              Age  Year Tm    Lg        G    PA g_diff sum_g_diff
  <chr>           <dbl> <dbl> <chr> <chr> <dbl> <dbl>  <dbl>      <dbl>
1 A.J. Pierzynski    36  2013 TEX   AL      134   529     32         35
2 A.J. Pierzynski    37  2014 TOT   ML      102   362     32         35
3 Albert Pujols      36  2016 LAA   AL      152   650      3         35
4 Albert Pujols      37  2017 LAA   AL      149   636      3         35

Or, if a cumulative sum (running total) of g_diff is the desired output (without summarising the data):

df1 %>%
  group_by(Name) %>%
  mutate(cols = c(g_diff[1], rep(0, n() -1))) %>%
  ungroup() %>%
  mutate(cum_sum = cumsum(cols)) %>%
  select(-cols)

# A tibble: 4 x 9
  Name              Age  Year Tm    Lg        G    PA g_diff cum_sum
  <chr>           <dbl> <dbl> <chr> <chr> <dbl> <dbl>  <dbl>   <dbl>
1 A.J. Pierzynski    36  2013 TEX   AL      134   529     32      32
2 A.J. Pierzynski    37  2014 TOT   ML      102   362     32      32
3 Albert Pujols      36  2016 LAA   AL      152   650      3      35
4 Albert Pujols      37  2017 LAA   AL      149   636      3      35

(This solution is based on this question)

Upvotes: 1

Related Questions