Reputation: 2730
I am new to the data.table package, and am semi-new to R, and would like to use data.table because of its speed in working with very large data sets.
Suppose we start with this dataframe called "data
", generated by the code beneath it:
> data
ID Period_1 Period_2 Values State
1: 1 1 2020-01 5 X0
2: 1 2 2020-02 10 X1
3: 1 3 2020-03 15 X2
4: 2 1 2020-04 0 X0
5: 2 2 2020-05 2 X2
6: 2 3 2020-06 4 X0
7: 3 1 2020-02 3 X2
8: 3 2 2020-03 6 X1
9: 3 3 2020-04 9 X0
data <-
data.frame(
ID = c(1,1,1,2,2,2,3,3,3),
Period_1 = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
Period_2 = c("2020-01","2020-02","2020-03","2020-04","2020-05","2020-06","2020-02","2020-03","2020-04"),
Values = c(5, 10, 15, 0, 2, 4, 3, 6, 9),
State = c("X0","X1","X2","X0","X2","X0", "X2","X1","X0")
)
I would like to create a new dataframe showing all the "states" flowing into a user-specified target state (call it "X") over time, as measured by Period_1 in this data
dataframe. For Period_1 = 1, we simply count all instances of target state X. For all Period_1 > 1, for any row with state = X, all instances of X are counted and placed into the corresponding row of the dataframe reflecting the state in the immediately prior Period_1 (for the same ID). How could this be efficiently done using data.table?
The image below illustrates this better, where the newly derived dataframe shows all states flowing into target state x0 over time:
I include in data
other columns (Period_2 and Values) for use as this function later evolves, for alternatively defining the time horizon as Period_2 and for summing the flow of Values rather than counting in instances of state. I should be able to do those on my own after someone gives me a kick start with the request above.
Upvotes: 1
Views: 75
Reputation: 42572
EDIT 10MAY2022: Streamlined code and adjusted explanation.
For the sake of completeness, here is a function definition which uses some of the more fancy parameters of the dcast()
function. Eventually, this function is expanded to achieve OP's more advanced requirements.
library(data.table)
state_inflow <- function(mydat, target_state) {
dcast(
setDT(mydat)[, Previous_State := shift(State, fill = target_state), by = ID],
factor(Previous_State) ~ factor(Period_1), length, value.var = "Values",
subset = .(State == target_state), drop = FALSE
)
}
Calling the function
state_inflow(data, "X0")
returns the expected result:
Previous_State 1 2 3 1: X0 2 0 0 2: X1 0 0 1 3: X2 0 0 1
ID
, State
is shifted (lagged) to get the previous state. For the first period of each ID
group, the target state is filled in. So, no special treatment of the first period is required.dcast()
where it is subsetted for the target state before transformation.drop = FALSE
casts by including all missing combinations. This is necessary because subsetting will remove some combinations.Values
column is used as value variable.The OP plans to evolve the function
for alternatively defining the time horizon as Period_2 and for summing the flow of Values rather than counting in instances of state
This can be achieved by adding more parameters to the function:
state_inflow <- function(mydat, target_state, period_col_name, fct) {
dcast(
setDT(mydat)[, Previous_State := shift(State, fill = target_state), by = ID],
as.formula(sprintf("factor(Previous_State) ~ factor(%s)", period_col_name)), fct,
value.var = "Values", subset = .(State == target_state), drop = FALSE
)
}
Some examples:
state_inflow(data, "X0", "Period_1", sum)
Previous_State 1 2 3 1: X0 5 0 0 2: X1 0 0 9 3: X2 0 0 4
state_inflow(data, "X2", "Period_2", length)
Previous_State 2020-01 2020-02 2020-03 2020-04 2020-05 2020-06 1: X0 0 0 0 0 1 0 2: X1 0 0 1 0 0 0 3: X2 0 1 0 0 0 0
Upvotes: 1
Reputation: 24832
Here is one option:
f <- function(s) {
dcast(
rbind(unique(data[,.(State,Period_1,N=0)]),
data[, priorState:=lag(State), by = ID] %>%
.[State==s] %>%
.[!is.na(priorState), State:=priorState] %>%
.[, .N, .(State,Period_1)]
),
State~Period_1, value.var="N", fun.aggregate=sum
)
}
setDT(data)
f("X0")
Output:
State 1 2 3
<char> <num> <num> <num>
1: X0 2 0 0
2: X1 0 0 1
3: X2 0 0 1
Upvotes: 2