How to use data.table to build a new dataframe showing inflows into a specified transition state based on the value of an element in a prior row?

I am new to the data.table package, and am semi-new to R, and would like to use data.table because of its speed in working with very large data sets.

Suppose we start with this dataframe called "data", generated by the code beneath it:

> data
   ID Period_1 Period_2 Values State
1:  1        1  2020-01      5    X0
2:  1        2  2020-02     10    X1
3:  1        3  2020-03     15    X2
4:  2        1  2020-04      0    X0
5:  2        2  2020-05      2    X2
6:  2        3  2020-06      4    X0
7:  3        1  2020-02      3    X2
8:  3        2  2020-03      6    X1
9:  3        3  2020-04      9    X0

data <- 
  data.frame(
    ID = c(1,1,1,2,2,2,3,3,3),
    Period_1 = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
    Period_2 = c("2020-01","2020-02","2020-03","2020-04","2020-05","2020-06","2020-02","2020-03","2020-04"),
    Values = c(5, 10, 15, 0, 2, 4, 3, 6, 9),
    State = c("X0","X1","X2","X0","X2","X0", "X2","X1","X0")
  )

I would like to create a new dataframe showing all the "states" flowing into a user-specified target state (call it "X") over time, as measured by Period_1 in this data dataframe. For Period_1 = 1, we simply count all instances of target state X. For all Period_1 > 1, for any row with state = X, all instances of X are counted and placed into the corresponding row of the dataframe reflecting the state in the immediately prior Period_1 (for the same ID). How could this be efficiently done using data.table?

The image below illustrates this better, where the newly derived dataframe shows all states flowing into target state x0 over time:

enter image description here

I include in data other columns (Period_2 and Values) for use as this function later evolves, for alternatively defining the time horizon as Period_2 and for summing the flow of Values rather than counting in instances of state. I should be able to do those on my own after someone gives me a kick start with the request above.

Upvotes: 1

Views: 75

Answers (2)

Uwe
Uwe

Reputation: 42572

EDIT 10MAY2022: Streamlined code and adjusted explanation.

For the sake of completeness, here is a function definition which uses some of the more fancy parameters of the dcast() function. Eventually, this function is expanded to achieve OP's more advanced requirements.

Simple function

library(data.table)
state_inflow <- function(mydat, target_state) {
  dcast(
    setDT(mydat)[, Previous_State := shift(State, fill = target_state), by = ID],
    factor(Previous_State) ~ factor(Period_1), length, value.var = "Values",
    subset = .(State == target_state), drop = FALSE
  )
} 

Calling the function

state_inflow(data, "X0")

returns the expected result:

   Previous_State 1 2 3
1:             X0 2 0 0
2:             X1 0 0 1
3:             X2 0 0 1

Explanation

  1. The function takes a data.table and the target state as parameters.
  2. For each ID, State is shifted (lagged) to get the previous state. For the first period of each ID group, the target state is filled in. So, no special treatment of the first period is required.
  3. The modified data.table is then passed to dcast() where it is subsetted for the target state before transformation.
  4. Previous state and period are turned into factors within the cast formula to allow for the completion of missing values.
  5. drop = FALSE casts by including all missing combinations. This is necessary because subsetting will remove some combinations.
  6. The Values column is used as value variable.

Advanced function

The OP plans to evolve the function

for alternatively defining the time horizon as Period_2 and for summing the flow of Values rather than counting in instances of state

This can be achieved by adding more parameters to the function:

state_inflow <- function(mydat, target_state, period_col_name, fct) {
  dcast(
    setDT(mydat)[, Previous_State := shift(State, fill = target_state), by = ID],
    as.formula(sprintf("factor(Previous_State) ~ factor(%s)", period_col_name)), fct, 
    value.var = "Values", subset = .(State == target_state), drop = FALSE
  )
}

Some examples:

state_inflow(data, "X0", "Period_1", sum)
   Previous_State 1 2 3
1:             X0 5 0 0
2:             X1 0 0 9
3:             X2 0 0 4
state_inflow(data, "X2", "Period_2", length)
   Previous_State 2020-01 2020-02 2020-03 2020-04 2020-05 2020-06
1:             X0       0       0       0       0       1       0
2:             X1       0       0       1       0       0       0
3:             X2       0       1       0       0       0       0

Upvotes: 1

langtang
langtang

Reputation: 24832

Here is one option:

f <- function(s) {
  dcast(
    rbind(unique(data[,.(State,Period_1,N=0)]),
        data[, priorState:=lag(State), by = ID] %>% 
          .[State==s] %>% 
          .[!is.na(priorState), State:=priorState] %>% 
          .[, .N, .(State,Period_1)]
    ),
    State~Period_1, value.var="N", fun.aggregate=sum
  )
}

setDT(data)

f("X0")

Output:

    State     1     2     3
   <char> <num> <num> <num>
1:     X0     2     0     0
2:     X1     0     0     1
3:     X2     0     0     1

Upvotes: 2

Related Questions