Mine
Mine

Reputation: 861

Cumulative Count of Multiple Columns of Data Table in r

Given the example datatable below, I am able to find the cumulative count for categorical columns but when the dataset is much larger the cumcount function is slow. I am looking for a much faster alternative than the cumcount function given below. Expected output is the totalCount variable below.

library(purrrlyr)
require(stringr)
require(data.table)


# Cumulative count function
cumcount <- function(x){
  cumcount <- numeric(length(x))
  names(cumcount) <- x
  for(i in 1:length(x)){
    cumcount[i] <- sum(x[1:i]==x[i])
  }
  return(cumcount - 1 )
}


# Example dataframe

cat_var <- c("rock", "indie", "rock", "rock", "pop", "indie", "pop", "rock")
cat_var_2 <- c("blue", "green", "red", "red", "blue", "red", "green", "blue")
target_var <- c(0, 0, 1, 1, 1, 1, 0, 1)
df <- data.table("categorical_variable" = cat_var, "categorical_variable_2" = cat_var_2, "target_variable" =  target_var)

# Cumulative count for categorical variables
nms <- c("categorical_variable", "categorical_variable_2")

totalCount <- sapply(df[,..nms], cumcount)


Upvotes: 0

Views: 132

Answers (1)

Peace Wang
Peace Wang

Reputation: 2419

Try data.table's built in function rowid

df[,lapply(.SD, \(x) rowid(x) - 1), .SDcols = nms]

Upvotes: 1

Related Questions