Reputation: 1077

R - Group By Multiple Columns

I'm trying to run analysis on a dataset that categorizes companies into 20 different industries, and some 800 categories. Each industry category is in it's own column. Here's a sample dataframe

df <- data.frame(biz.name=c("goog", "face", "eb"), worth=c(100, 200, 300),
cat1=c("social", "social", "social"), cat2=c(NA, "search", "finance"),
cat3=c(NA, NA, "commerce"))

I'd like to know how to run analysis on different types of categories. For instance, how would I get the the average worth of different categories, "social" or "finance". Each company can be in up to 20 categories (non-repeating per row).

The dplyr package is my normal go-to group_by method, but chaining doesn't seem to work for multiple columns

cat.test <- df %>% 
  group_by(cat1:cat2) %>%
  summarise (avg = mean(is.na(worth)))

The code produces a measure for each permutation of businesses with a combination of multiple categories, rather that each category individually. In the sample data frame, the category social should have a total net worth of 600 and mean of 300.

I've looked at multiple tutorials, but haven't found one that can group_by for multiple columns. Thanks and let me know if i can make this question any more clear.

[UPDATE: edited data.frame code]

Upvotes: 4

Answers (3)

Colonel Beauvel

Reputation: 31181

I would use data.table this way:

library(data.table)
melt(setDT(df[-1]), id.vars='worth', value.name='category')[,.(worth=sum(worth)),category]
#   category worth
#1:   social   600
#2:       NA   400
#3:   search   200
#4:  finance   300
#5: commerce   300

Upvotes: 3

thelatemail

Reputation: 93938

Using tidyr, something like:

library(tidyr)
df %>%
  gather(variable, category, -biz.name, -worth) %>%
  group_by(category) %>%
  summarise(worth=sum(worth))
#Source: local data frame [5 x 2]
# 
#  category worth
#1 commerce   300
#2  finance   300
#3   search   200
#4   social   600
#5       NA   400

This matches your requested sum figure for 'social'

Upvotes: 3

Tim Biegeleisen

Reputation: 522762

I cleaned up your code and was able to get a result out using the data.table package:

df <- data.frame(biz.name=c("goog", "face", "eb"), worth=c(100, 200, 300), 
                 cat1=c("social", "social", "social"), cat2=c("NA", "search", "finance"),
                 cat3=c("NA", "NA", "commerce"))

library(data.table)
dt <- data.table(df)
dt[, Mean:=mean(worth), by=list(cat1, cat2)]

> dt
     biz.name  worth   cat1    cat2     cat3 Mean
1:       goog    100 social      NA       NA  100
2:       face    200 social  search       NA  200
3:         eb    300 social finance commerce  300

Upvotes: 4

R - Group By Multiple Columns

Answers (3)

Related Questions