Reputation: 586
I am trying to create a set of cumulative factor variables in R. My df
has treatment dummies for 4 moments of time:
id t1 t2 t3 t4
1 0 0 0 1
2 1 0 0 0
3 0 0 0 1
4 0 1 0 0
5 1 0 0 0
What I want is a set of cumulative treatment variables (named tc in the following example) by time like this:
id tc1 tc2 tc3 tc4
1 0 0 0 1
2 1 1 1 1
3 0 0 0 1
4 0 1 1 1
5 1 1 1 1
I have tried the cumsum
function, but I do not know how to handle this function for factor variables. Any idea of how to do this?
Upvotes: 1
Views: 631
Reputation: 92302
One way is to try the matrixStats::rowCummaxs
function, but you will need to convert to a matrix
first. Though, judging by your data structure, I would recommend working with a matrix
instead of a data.frame
in the first place
data1[-1] <- matrixStats::rowCummaxs(as.matrix(data1[-1]))
data1
# id t1 t2 t3 t4
# 1 1 0 0 0 1
# 2 2 1 1 1 1
# 3 3 0 0 0 1
# 4 4 0 1 1 1
# 5 5 1 1 1 1
Or the blantant apply
by row approach (which also convert to a matrix
)
data1[-1] <- t(apply(data1[-1], 1, cummax))
Or as @joran implied - we could try the long/wide transformation
library(data.table)
dcast(melt(setDT(data1),
id = "id"
)[, value := cummax(value),
by = id],
id ~ variable)
# id t1 t2 t3 t4
# 1: 1 0 0 0 1
# 2: 2 1 1 1 1
# 3: 3 0 0 0 1
# 4: 4 0 1 1 1
# 5: 5 1 1 1 1
Or
library(dplyr)
library(tidyr)
data1 %>%
gather(variable, value, -id) %>%
group_by(id) %>%
mutate(value = cummax(value)) %>%
spread(variable, value)
# Source: local data frame [5 x 5]
# Groups: id [5]
#
# id t1 t2 t3 t4
# (int) (int) (int) (int) (int)
# 1 1 0 0 0 1
# 2 2 1 1 1 1
# 3 3 0 0 0 1
# 4 4 0 1 1 1
# 5 5 1 1 1 1
Or an interesting alternative by @alexis_laz accumulating pmax
per row using Reduce
data1[-1] <- Reduce(pmax, data1[-1], accumulate = TRUE)
data1
# id t1 t2 t3 t4
# 1 1 0 0 0 1
# 2 2 1 1 1 1
# 3 3 0 0 0 1
# 4 4 0 1 1 1
# 5 5 1 1 1 1
Upvotes: 4
Reputation: 93938
max.col
to the rescue:
df[-1][col(df[-1]) >= max.col(df[-1], ties.method="first")] <- 1
df
# id t1 t2 t3 t4
#1 1 0 0 0 1
#2 2 1 1 1 1
#3 3 0 0 0 1
#4 4 0 1 1 1
#5 5 1 1 1 1
And some more detailed explanation of how this works:
col(df[-1])
# [,1] [,2] [,3] [,4]
#[1,] 1 2 3 4
#[2,] 1 2 3 4
#[3,] 1 2 3 4
#[4,] 1 2 3 4
#[5,] 1 2 3 4
max.col(df[-1], ties.method="first")
#[1] 4 1 4 2 1
col(df[-1]) >= max.col(df[-1], ties.method="first")
# [,1] [,2] [,3] [,4]
#[1,] FALSE FALSE FALSE TRUE
#[2,] TRUE TRUE TRUE TRUE
#[3,] FALSE FALSE FALSE TRUE
#[4,] FALSE TRUE TRUE TRUE
#[5,] TRUE TRUE TRUE TRUE
Upvotes: 3