Prometheus
Prometheus

Reputation: 2017

group data and calculate n-1 for periods

This is my dataframe:

dput(test)
structure(list(Branch = c("11 Oktomvri", "11 Oktomvri", "11 Oktomvri", 
"11 Oktomvri", "11 Oktomvri", "11 Oktomvri", "11 Oktomvri", "Aerodrom", 
"Aerodrom", "Aerodrom", "Aerodrom", "Aerodrom", "Aerodrom", "Aerodrom", 
"Aerodrom 2", "Aerodrom 2", "Aerodrom 2", "Aerodrom 2", "Aerodrom 2", 
"Aerodrom 2", "Aerodrom 2", "Bitola", "Bitola", "Bitola", "Bitola", 
"Bitola", "Bitola", "Bitola"), period = c("January", "February", 
"March", "April", "May", "June", "July", "January", "February", 
"March", "April", "May", "June", "July", "January", "February", 
"March", "April", "May", "June", "July", "January", "February", 
"March", "April", "May", "June", "July"), value = c(1513, 1511, 
1520, 1524, 1508, 1504, 1517, 1364, 1381, 1400, 1403, 1401, 1406, 
1430, 674, 687, 689, 690, 696, 705, 715, 4400, 4393, 4365, 4342, 
4345, 4373, 4389)), .Names = c("Branch", "period", "value"), row.names = c(NA, 
-28L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars = c("Branch", 
"period"), drop = TRUE, indices = list(3L, 1L, 0L, 6L, 5L, 2L, 
    4L, 10L, 8L, 7L, 13L, 12L, 9L, 11L, 17L, 15L, 14L, 20L, 19L, 
    16L, 18L, 24L, 22L, 21L, 27L, 26L, 23L, 25L), group_sizes = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), biggest_group_size = 1L, labels = structure(list(
    Branch = c("11 Oktomvri", "11 Oktomvri", "11 Oktomvri", "11 Oktomvri", 
    "11 Oktomvri", "11 Oktomvri", "11 Oktomvri", "Aerodrom", 
    "Aerodrom", "Aerodrom", "Aerodrom", "Aerodrom", "Aerodrom", 
    "Aerodrom", "Aerodrom 2", "Aerodrom 2", "Aerodrom 2", "Aerodrom 2", 
    "Aerodrom 2", "Aerodrom 2", "Aerodrom 2", "Bitola", "Bitola", 
    "Bitola", "Bitola", "Bitola", "Bitola", "Bitola"), period = c("April", 
    "February", "January", "July", "June", "March", "May", "April", 
    "February", "January", "July", "June", "March", "May", "April", 
    "February", "January", "July", "June", "March", "May", "April", 
    "February", "January", "July", "June", "March", "May")), row.names = c(NA, 
-28L), class = "data.frame", vars = c("Branch", "period"), drop = TRUE, .Names = c("Branch", 
"period")))

I'm not sure how to group the data based on Branch and Period, and calculate value based on period_n - period_n-1.

The output should be:

city     period   value  diff_n_1
Bitola   March     4365    -28
Bitola   April     2000    13

My attepmt:

results <- sample2 %>% 
  group_by(Branch, period) %>%
  arrange(Branch) %>% 
  mutate(lagged_period = lag(value), client_diff = value - lagged_period)

I'm not sure how to pull-off the last line.

Any ideas?

Upvotes: 1

Views: 35

Answers (1)

Dave Gruenewald
Dave Gruenewald

Reputation: 5689

I think you had it nearly right, but you don't need the Period in group_by, since that really is the variable we are using lag on. Currently, your above work is resulting in exclusively NAs because there is only one element in each defined group (so no lag is available).

This should work:

library(dplyr)

sample2 %>% 
  group_by(Branch) %>%
  arrange(Branch) %>% 
  mutate(lagged_period = lag(value),
         client_diff = value - lagged_period)

Then, of course, if you wanted to remove the NAs from the calculated field, you can just pipe the above into:

filter(!is.na(client_diff))

Upvotes: 2

Related Questions