exhoosier10
exhoosier10

Reputation: 121

R: How to sum based on multiple criteria and summarize table

Here is my original data frame:

df <- read.table(text="
  Date         Index  Event
  2014-03-31   A      x
  2014-03-31   A      x
  2014-03-31   A      y
  2014-04-01   A      y
  2014-04-01   A      x
  2014-04-01   B      x
  2014-04-02   B      x
  2014-04-03   A      x
  2014-09-30   B      x", header = T, stringsAsFactors = F)

date_range <- seq(as.Date(min(df$Date)), as.Date(max(df$Date)), 'days')
indices <- unique(df$Index)
events_table <- unique(df$Event)

I want my desired output to summarise my dataframe and have a unique record for each index in indices and each date in date_range while providing a cumulative value of each event in events_table in a new column for all dates prior to the value in the Date column. Sometimes there are no records for each index or every date.

Here is my desired output:

Date        Index  cumsum(Event = x) cumsum(Event = y)
2014-03-31  A      0                 0
2014-03-31  B      0                 0
2014-04-01  A      2                 1
2014-04-01  B      0                 0
2014-04-02  A      3                 2
2014-04-02  B      1                 0
...  
2014-09-29  A      4                 2
2014-09-29  B      2                 0
2014-09-30  A      4                 2
2014-09-30  B      2                 0

FYI -- this is a simplified version of the data frame. There are ~200,000 records per year with hundreds of different Index fields for each Date.

I've done this in the past before my hard drive fried using by and maybe aggregate, but the process was very slow and I'm not able to get it worked out this time around. I've also tried ddply, but I'm not able to get the cumsum function to work with it. Using ddply, I tried something like:

ddply(xo1, .(Date,Index), summarise, 
      sum.x = sum(Event == 'x'), 
      sum.y = sum(Event == 'y'))

to no avail.
Through searching, I've found Replicating an Excel SUMIFS formula which gets me the cumulative part of my project, but with this I wasn't able to figure out how to summarize it down to only one record per date/index combo. I also came across sum/aggregate data based on dates, R but here I wasn't able to work out the dynamic date aspect.

Thanks for anyone that can help!

Upvotes: 7

Views: 2711

Answers (2)

davechilders
davechilders

Reputation: 9133

library(dplyr)
library(tidyr)

df$Date <- as.Date(df$Date)

Step 1: Generate a full list of {Date, Index} pairs

full_dat <- expand.grid(
  Date = date_range, 
  Index = indices,
  stringsAsFactors = FALSE
  ) %>% 
  arrange(Date, Index) %>%
  tbl_df

Step 2: Define a cumsum() function that ignores NA

cumsum2 <- function(x){

  x[is.na(x)] <- 0
  cumsum(x)

}

Step 3: Generate totals per {Date, Index}, join with full {Date, Index} data, and compute the lagged cumulative sum.

df %>%
  group_by(Date, Index) %>%
  summarise(
    totx = sum(Event == "x"),
    toty = sum(Event == "y")
    ) %>%
  right_join(full_dat, by = c("Date", "Index")) %>% 
  group_by(Index) %>%
  mutate(
    cumx = lag(cumsum2(totx)),
    cumy = lag(cumsum2(toty))
    ) %>%
  # some clean up.
  select(-starts_with("tot")) %>%
  mutate(
    cumx = ifelse(is.na(cumx), 0, cumx),
    cumy = ifelse(is.na(cumy), 0, cumy)
    )

Upvotes: 3

maloneypatr
maloneypatr

Reputation: 3622

Would something like this using dplyr and tidyr work?

library(dplyr)
library(tidyr)

df %>%
  group_by(Date, Index, Event) %>%
  summarise(events = n()) %>%
  group_by(Index, Event) %>%
  mutate(cumsum_events = cumsum(events)) %>%
  select(-events) %>%
  spread(Event, cumsum_events) %>%
  rename(sum.x = x,
         sum.y = y)

#        Date Index sum.x sum.y
#1 2014-03-31     A     2     1
#2 2014-04-01     A     3     2
#3 2014-04-01     B     1    NA
#4 2014-04-02     B     2    NA
#5 2014-04-03     A     4    NA
#6 2014-09-30     B     3    NA

Upvotes: 1

Related Questions