htaunay
htaunay

Reputation: 73

Updating a table with the rolling average of previous rows in R?

So I have a table where every row represents a given user in a specific event. Each row contains two types of information: the outcomes of such event, as well as data regarding a user specifically. Multiple users can take part in the a same event.

For clarity, here is an simplified example of such table:

EventID       Date  Revenue   Time(s)  UserID     X     Y    Z
      1   1/1/2017      $10       120       1     3     2    2
      1   1/1/2017      $15       150       2     2     1    2
      2   2/1/2017      $50        60       1     1     5    1
      2   2/1/2017      $45       100       4     3     5    2
      3   3/1/2017      $25        75       1     2     3    1
      3   3/1/2017      $20       210       2     5     5    1
      3   3/1/2017      $25       120       3     1     0    4
      3   3/1/2017      $15       100       4     3     1    1
      4   4/1/2017      $75        25       4     0     2    1

My goal is to build a model that can, given a specific user's performance history (in the example attributes X, Y and Z), predict a given revenue and time for an event.

What I am after now is a way to format my data in order to train and test such model. More specifically, I want to transform the table in a way that each row would keep the event specific information, while presenting the moving average of each users attributes up until the previous event. An example of the thought process could be: a user up until an event presents averages of 2, 3.5, and 1.5 in attributes X, Y and Z respectively, and the revenue and time outcomes of such event were $25 and 75, now I will use this as a input for my training.

Once again for clarity, here is an example of the output I would expect applying such logic on the original table:

EventID       Date  Revenue   Time(s)  UserID     X     Y    Z
      1   1/1/2017      $10       120       1     0     0    0
      1   1/1/2017      $15       150       2     0     0    0
      2   2/1/2017      $50        60       1     3     2    2
      2   2/1/2017      $45       100       4     0     0    0
      3   3/1/2017      $25        75       1     2   3.5  1.5
      3   3/1/2017      $20       210       2     2     1    2
      3   3/1/2017      $25       120       3     0     0    0
      3   3/1/2017      $15       100       4     3     5    2
      4   4/1/2017      $75        25       4     3     3  1.5

Notice that in each users first appearance all attributes are 0, since we still know nothing about them. Also, in a user's second appearance, all we know is the result of his first appearance. In lines 5 and 9, users 1 and 4 third appearances start to show the rolling mean of their previous performances.

If I were dealing with only a single user, I would tackle this problem by simply calculating the moving average of his attributes, and then shifting only the data in the attribute columns down one row. My questions are:

It can assumed that all rows are already sorted by date. Any other tips or references related to this problem are also welcome.

Also, It wasn't obvious how to summarize my question with a one liner title, so I'm open to suggestions from any R experts that might think of an improved way of describing it.

Upvotes: 0

Views: 1460

Answers (1)

Juan Bosco
Juan Bosco

Reputation: 1430

We can achieve your desired output using the dplyr package.

library(dplyr)

tablinka %>% 
      arrange(UserID, EventID) %>% 
      group_by(UserID) %>%
      mutate_at(c("X", "Y", "Z"), cummean) %>% 
      mutate_at(c("X", "Y", "Z"), lag) %>% 
      mutate_at(c("X", "Y", "Z"), funs(ifelse(is.na(.), 0, .))) %>% 
      arrange(EventID, UserID) %>% 
      ungroup()

We arrange the data, group it, and then apply the desired transformations (the dplyr functions cummean, lag, and replacing NA with 0 using an ifelse).

Once this is done, we rearrange the data to its original state, and ungroup it.

Upvotes: 2

Related Questions