Brisbane Pom
Brisbane Pom

Reputation: 611

Is there a dplyr function to determine the most commonly encountered categorical value within a group?

I am looking to summarize a customer transactional dataframe to a single row per customer using dplyr. For continuous variables this is simple - use sum / mean etc. For categorical variables I would like to choose the "Mode" - i.e. the most commonly encountered value within the group and do this across multiple columns e.g.:

For example to take the table Cus1

Cus <- data.frame(Customer = c("C-01", "C-01", "C-02", "C-02", "C-02", "C-02", "C-03", "C-03"),
             Product = c("COKE", "COKE", "FRIES", "SHAKE", "BURGER", "BURGER", "CHICKEN", "FISH"),
              Store = c("NYC", "NYC", "Chicago", "Chicago", "Detroit", "Detroit", "LA", "San Fran")
              )

And generate the table Cus_Summary:

Cus_Summary <- data.frame(Customer = c("C-01", "C-02", "C-03"),
              Product = c("COKE", "BURGER", "CHICKEN"),
              Store = c("NYC", "Chicago", "LA")
              )

Are there any packages that can provide this function? Or has anyone a function that can be applied across multiple columns within a dplyr step?

I am not worried about smart ways to handle ties - any output for a tie will suffice (although any suggestions as to how to best handle ties would be interesting and appreciated).

Upvotes: 2

Views: 710

Answers (4)

moodymudskipper
moodymudskipper

Reputation: 47310

Using SO's favourite Mode function (though you could use any):

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

In base R

aggregate(. ~ Customer, lapply(Cus,as.character),  Mode)
#   Customer Product   Store
# 1     C-01    COKE     NYC
# 2     C-02  BURGER Chicago
# 3     C-03 CHICKEN      LA

using dplyr

library(dplyr)
Cus %>%
  group_by(Customer) %>%
  summarise_all(Mode)

# # A tibble: 3 x 3
# Customer Product   Store
# <fctr>  <fctr>  <fctr>
# 1     C-01    COKE     NYC
# 2     C-02  BURGER Chicago
# 3     C-03 CHICKEN      LA

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388982

If you have many columns and want to find out maximum occurrence in all the columns you could use gather to convert the data in long format, count the occurrence for each column, group_by Customer and column and keep only the rows with maximum count and then spread it back to wide format.

library(tidyverse)

Cus %>%
  gather(key, value, -Customer) %>%
  count(Customer, key, value) %>%
  group_by(Customer, key) %>%
  slice(which.max(n)) %>%
  ungroup() %>%
  spread(key, value) %>%
  select(-n)

# Customer Product Store  
#  <fct>    <chr>   <chr>  
#1 C-01     COKE    NYC    
#2 C-02     BURGER  Chicago
#3 C-03     CHICKEN LA   

EDIT

In case of ties if we want to randomly select ties we can filter all the max values and then use sample_n function to select random rows.

Cus %>%
  gather(key, value, -Customer) %>%
  count(Customer, key, value) %>%
  group_by(Customer, key) %>%
  filter(n == max(n)) %>%
  sample_n(1) %>%
  ungroup() %>%
  spread(key, value) %>%
  select(-n)


# Customer Product Store   
#  <fct>    <chr>   <chr>   
#1 C-01     COKE    NYC     
#2 C-02     BURGER  Chicago 
#3 C-03     FISH    San Fran

Upvotes: 0

Maurits Evers
Maurits Evers

Reputation: 50678

How about this?

Cus %>%
    group_by(Customer) %>%
    summarise(
        Product = first(names(sort(table(Product), decreasing = TRUE))),
        Store = first(names(sort(table(Store), decreasing = TRUE))))
## A tibble: 3 x 3
#  Customer Product Store
#  <fct>    <chr>   <chr>
#1 C-01     COKE    NYC
#2 C-02     BURGER  Chicago
#3 C-03     CHICKEN LA

Note that in the case of ties this selects the first entry in alphabetical order.


Update

To randomly select an entry from tied top frequency entries we could define a custom function

top_random <- function(x) {
    tbl <- sort(table(x), decreasing = T)
    top <- tbl[tbl == max(tbl)]
    return(sample(names(top), 1))
}

Then the following randomly selects one of the tied top entries:

Cus %>%
    group_by(Customer) %>%
    summarise(
        Product = top_random(Product),
        Store = top_random(Store))

Upvotes: 3

Paweł Chabros
Paweł Chabros

Reputation: 2399

In my solution, if there are more than one most frequent value, all are presented:

library(tidyverse)

Cus %>%
  gather('type', 'value', -Customer) %>%
  group_by(Customer, type, value) %>%
  count() %>%
  group_by(Customer) %>%
  filter(n == max(n)) %>%
  nest() %>%
  mutate(
    Product = map_chr(data, ~str_c(filter(.x, type == 'Product') %>% pull(value), collapse = ', ')),
    Store = map_chr(data, ~str_c(filter(.x, type == 'Store') %>% pull(value), collapse = ', '))
  ) %>%
  select(-data)

Result is:

# A tibble: 3 x 3
  Customer Product       Store           
  <fct>    <chr>         <chr>           
1 C-01     COKE          NYC             
2 C-02     BURGER        Chicago, Detroit
3 C-03     CHICKEN, FISH LA, San Fran  

Upvotes: 0

Related Questions