Reputation: 381
I am trying to sum each column in a data frame by group and set the value as 1 if the sum is not 0. I tried to use max function instead of the combo (sum & ifelse), but I kept getting Inf values. However, the combo takes too much time to compute, where I have 1.5m rows and 500 dummy variables to summarize.
Is there a better way to achieve this?
library(tidyverse)
library(tibble)
library(data.table)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10002)
id <- sample(1:20, 1000, replace=T)
set.seed(10003)
group1 <- sample(0:1, 1000, replace=T)
set.seed(10004)
group2 <- sample(0:1, 1000, replace=T)
dummies <-
data.frame(id, group1, group2)
# I am trying to sum each column in a data frame by group and
# set the value as 1 if the sum is not 0.
dummies %>%
ddply('id', function(x){
x %>%
select_if(is.numeric) %>%
summarise_each(list(sum)) %>%
mutate_if(is.numeric, ~ifelse(.x > 0,1,.x))
}, .progress = 'text') # It takes too much time
Upvotes: 3
Views: 326
Reputation: 887108
We could possibly reduce the time by switching to dplyr
. Also, instead of doing the sum
and then using ifelse
to check and reconvert, this can be directly done by checking any
value greater than 0
library(dplyr)
dummies %>%
dplyr::select(id, where(is.numeric)) %>%
dplyr::group_by(id) %>%
dplyr::summarise(across(everything(), ~ +(any(. > 0, na.rm = TRUE))))
or using data.table
library(data.table)
setDT(dummies)[, lapply(.SD, function(x)
+(any(x > 0, na.rm = TRUE))), id, .SDcols = patterns('group')]
Upvotes: 4