Reputation: 611
I am looking to summarize a customer transactional dataframe to a single row per customer using dplyr. For continuous variables this is simple - use sum / mean etc. For categorical variables I would like to choose the "Mode" - i.e. the most commonly encountered value within the group and do this across multiple columns e.g.:
For example to take the table Cus1
Cus <- data.frame(Customer = c("C-01", "C-01", "C-02", "C-02", "C-02", "C-02", "C-03", "C-03"),
Product = c("COKE", "COKE", "FRIES", "SHAKE", "BURGER", "BURGER", "CHICKEN", "FISH"),
Store = c("NYC", "NYC", "Chicago", "Chicago", "Detroit", "Detroit", "LA", "San Fran")
)
And generate the table Cus_Summary:
Cus_Summary <- data.frame(Customer = c("C-01", "C-02", "C-03"),
Product = c("COKE", "BURGER", "CHICKEN"),
Store = c("NYC", "Chicago", "LA")
)
Are there any packages that can provide this function? Or has anyone a function that can be applied across multiple columns within a dplyr step?
I am not worried about smart ways to handle ties - any output for a tie will suffice (although any suggestions as to how to best handle ties would be interesting and appreciated).
Upvotes: 2
Views: 710
Reputation: 47310
Using SO's favourite Mode function (though you could use any):
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
In base R
aggregate(. ~ Customer, lapply(Cus,as.character), Mode)
# Customer Product Store
# 1 C-01 COKE NYC
# 2 C-02 BURGER Chicago
# 3 C-03 CHICKEN LA
using dplyr
library(dplyr)
Cus %>%
group_by(Customer) %>%
summarise_all(Mode)
# # A tibble: 3 x 3
# Customer Product Store
# <fctr> <fctr> <fctr>
# 1 C-01 COKE NYC
# 2 C-02 BURGER Chicago
# 3 C-03 CHICKEN LA
Upvotes: 0
Reputation: 388982
If you have many columns and want to find out maximum occurrence in all the columns you could use gather
to convert the data in long format, count
the occurrence for each column, group_by
Customer
and column and keep only the rows with maximum count and then spread
it back to wide format.
library(tidyverse)
Cus %>%
gather(key, value, -Customer) %>%
count(Customer, key, value) %>%
group_by(Customer, key) %>%
slice(which.max(n)) %>%
ungroup() %>%
spread(key, value) %>%
select(-n)
# Customer Product Store
# <fct> <chr> <chr>
#1 C-01 COKE NYC
#2 C-02 BURGER Chicago
#3 C-03 CHICKEN LA
EDIT
In case of ties if we want to randomly select ties we can filter
all the max
values and then use sample_n
function to select random rows.
Cus %>%
gather(key, value, -Customer) %>%
count(Customer, key, value) %>%
group_by(Customer, key) %>%
filter(n == max(n)) %>%
sample_n(1) %>%
ungroup() %>%
spread(key, value) %>%
select(-n)
# Customer Product Store
# <fct> <chr> <chr>
#1 C-01 COKE NYC
#2 C-02 BURGER Chicago
#3 C-03 FISH San Fran
Upvotes: 0
Reputation: 50678
How about this?
Cus %>%
group_by(Customer) %>%
summarise(
Product = first(names(sort(table(Product), decreasing = TRUE))),
Store = first(names(sort(table(Store), decreasing = TRUE))))
## A tibble: 3 x 3
# Customer Product Store
# <fct> <chr> <chr>
#1 C-01 COKE NYC
#2 C-02 BURGER Chicago
#3 C-03 CHICKEN LA
Note that in the case of ties this selects the first entry in alphabetical order.
To randomly select an entry from tied top frequency entries we could define a custom function
top_random <- function(x) {
tbl <- sort(table(x), decreasing = T)
top <- tbl[tbl == max(tbl)]
return(sample(names(top), 1))
}
Then the following randomly selects one of the tied top entries:
Cus %>%
group_by(Customer) %>%
summarise(
Product = top_random(Product),
Store = top_random(Store))
Upvotes: 3
Reputation: 2399
In my solution, if there are more than one most frequent value, all are presented:
library(tidyverse)
Cus %>%
gather('type', 'value', -Customer) %>%
group_by(Customer, type, value) %>%
count() %>%
group_by(Customer) %>%
filter(n == max(n)) %>%
nest() %>%
mutate(
Product = map_chr(data, ~str_c(filter(.x, type == 'Product') %>% pull(value), collapse = ', ')),
Store = map_chr(data, ~str_c(filter(.x, type == 'Store') %>% pull(value), collapse = ', '))
) %>%
select(-data)
Result is:
# A tibble: 3 x 3
Customer Product Store
<fct> <chr> <chr>
1 C-01 COKE NYC
2 C-02 BURGER Chicago, Detroit
3 C-03 CHICKEN, FISH LA, San Fran
Upvotes: 0