hammertyme
hammertyme

Reputation: 63

Groupby Multiple Columns using an input vector SparkR

I am using SparkR 2.1.0 for data manipulation

I want to group by multiple columns in a programmatic manner. I know I can group by multiple columns if I list them out individually, or reference their position from a vector... But I want to be able to pass the list of columns as a vector (this way, the function automatically adjusts to the number of arguments I pass it)

Dummy data:

 cpny <- c("Fakeco1", "Fakeco2", "Fakeco3", "Fakeco4", "Fakeco5", "Fakeco6")
 state <- c("CA", "NY", "WA", "CA", "CA", "NY")
 public <- c("Y", "Y", "N", "N", "N", "N")
 color <- c("White", "Red", "Green", "Green", "Green", "Red")
 revs <- c(400, 200, 900, 500, 200, 120)
 df <- data.frame(cpny, state, public, color, revs)
 # Convert to SparkR dataframe
 df_s <- as.DataFrame(df)    

Works:

  df_grouped <- df_s %>%
  groupBy('state', 'public') %>%
  summarize(sum_Revs = sum(df_s$revs))

Also works:

  group_vars <- c('state', 'public')

  df_grouped <- df_s %>%
  groupBy(group_vars[[1]], group_vars[[2]]) %>%
  summarize(sum_Revs = sum(df_s$revs))

Doesn't work:

  group_vars <- c('state', 'public')

  df_grouped <- df_s %>%
  groupBy(group_vars) %>%
  summarize(sum_Revs = sum(df_s$revs))

Any solutions or alternative thoughts?

Upvotes: 4

Views: 615

Answers (1)

dewilliams
dewilliams

Reputation: 46

You can use do.call() https://stat.ethz.ch/R-manual/R-devel/library/base/html/do.call.html and put your columns as well as the dataframe into a list. The following works for me:

cpny <- c("Fakeco1", "Fakeco2", "Fakeco3", "Fakeco4", "Fakeco5", "Fakeco6")
state <- c("CA", "NY", "WA", "CA", "CA", "NY")
public <- c("Y", "Y", "N", "N", "N", "N")
color <- c("White", "Red", "Green", "Green", "Green", "Red")
revs <- c(400, 200, 900, 500, 200, 120)
df <- data.frame(cpny, state, public, color, revs)
# Convert to SparkR dataframe
df_s <- as.DataFrame(df)  

group_vars <- c('state', 'public')


function_params <- list(df_s)
for (i in range(1:length(group_vars))) {
    function_params[[i+1]] <- group_vars[i]
}

summarized<- do.call(SparkR::groupBy, function_params) %>%  SparkR::summarize(sum_Revs = sum(df_s$revs))
SparkR::head(summarized)

Upvotes: 3

Related Questions