J.K.
J.K.

Reputation: 381

Better & faster way to sum & ifelse for a large set of columns in a big data frame using ddply R

Question

I am trying to sum each column in a data frame by group and set the value as 1 if the sum is not 0. I tried to use max function instead of the combo (sum & ifelse), but I kept getting Inf values. However, the combo takes too much time to compute, where I have 1.5m rows and 500 dummy variables to summarize.

Is there a better way to achieve this?

Example dataset

  library(tidyverse)
  library(tibble)
  library(data.table)
  
  rename <- dplyr::rename
  select <- dplyr::select
  
  set.seed(10002)
  id <- sample(1:20, 1000, replace=T)
  
  set.seed(10003)
  group1 <- sample(0:1, 1000, replace=T)
  
  set.seed(10004)
  group2 <- sample(0:1, 1000, replace=T)

  dummies <-
    data.frame(id, group1, group2) 

Current Approach

# I am trying to sum each column in a data frame by group and 
# set the value as 1 if the sum is not 0.

  dummies %>% 
    ddply('id', function(x){
      x %>% 
        select_if(is.numeric) %>%
        summarise_each(list(sum)) %>% 
        mutate_if(is.numeric, ~ifelse(.x > 0,1,.x))
    }, .progress = 'text') # It takes too much time 

Upvotes: 3

Views: 326

Answers (1)

akrun
akrun

Reputation: 887108

We could possibly reduce the time by switching to dplyr. Also, instead of doing the sum and then using ifelse to check and reconvert, this can be directly done by checking any value greater than 0

library(dplyr)
dummies %>% 
    dplyr::select(id, where(is.numeric)) %>%
    dplyr::group_by(id) %>% 
    dplyr::summarise(across(everything(), ~ +(any(. > 0, na.rm = TRUE))))

or using data.table

library(data.table)
setDT(dummies)[, lapply(.SD, function(x)
        +(any(x > 0, na.rm = TRUE))), id, .SDcols = patterns('group')]

Upvotes: 4

Related Questions