Reputation: 251
My data consists of 3 columns: - segment - category - product number
How can I create a cross tab ("segment" as columns, "category" as rows) that counts unique values of "product number" (example below) ?
SEG1 SEG2 SEG3
CAT1 X
CAT2
CAT3
X - number of unique values from CAT1 and SEG1
Exmple data
CAT<-c("CAT1","CAT3","CAT3","CAT1","CAT2","CAT3","CAT3","CAT3","CAT3","CAT2")
SEG<-c("SEG1","SEG3","SEG3","SEG2","SEG2","SEG2","SEG3","SEG3","SEG2","SEG2")
PRODUCT<-c("a","a","a","a","d","e","b","c","a","a")
data<-cbind(CAT,SEG,PRODUCT)
Many thanks in advance ! Best regards, Bartek
Upvotes: 2
Views: 1709
Reputation: 3883
A couple more suggestions if your data is in a data frame. Can use dplyr::n_distinct
instead of n_unique
as defined below.
data = data.frame(
CAT = factor(CAT),
SEG = factor(SEG),
PRODUCT = as.character(PRODUCT)
)
n_unique <- function(x) length(unique(x))
Using base R
dat_counts <- aggregate(PRODUCT ~ CAT + SEG, data = data, FUN = n_unique)
xtabs(PRODUCT ~ CAT + SEG, data = dat_counts)
Using the tables package. Note, the grouping variables (CAT and SEG) need to be factors, and the value variable (PRODUCT) needs to be a character variable.
tables::tabular(
CAT ~ SEG * PRODUCT * n_unique
, data = data)
These are fine for small datasets, but the other approaches are faster for larger datasets.
Upvotes: 0
Reputation: 303
High speed solution with dplyr and tidyr packages.
library(dplyr)
library(tidyr)
CAT <- c("CAT1","CAT3","CAT3","CAT1","CAT2","CAT3","CAT3","CAT3","CAT3","CAT2")
SEG <- c("SEG1","SEG3","SEG3","SEG2","SEG2","SEG2","SEG3","SEG3","SEG2","SEG2")
PRODUCT <- c("a","a","a","a","d","e","b","c","a","a")
data <- data.frame(CAT, SEG, PRODUCT)
# Elegant solution with pipes (%>%)
data %>%
group_by(CAT, SEG) %>%
summarize(uni.prod = n_distinct(PRODUCT)) %>%
spread(CAT, uni.prod)
# Solution without use pipes
groups <- group_by(data, CAT, SEG)
s <- summarize(groups, uni.prod = n_distinct(PRODUCT))
spread(s, CAT, uni.prod)
Upvotes: 1
Reputation: 194
In case you are using a data.table, you could really speed the operation up for larger dataframes. You could use
library(data.table)
library(reshape)
DF<-data.table(DF)
DF_agg<-DF[,j=list(count_prod=length(unique(DF$Product_Number)),by=c("Segment","Category")]
DF_agg<-cast(DF_agg,Segment~Category,sum)
Upvotes: 1
Reputation: 24074
You can simply compute a crosstable of your data but without duplicated lines to be sure to only count unique product numbers:
nodup <- which(!duplicated(data))
table(data[nodup, "CAT"],data[nodup, "SEG"])
SEG1 SEG2 SEG3
CAT1 1 1 0
CAT2 0 2 0
CAT3 0 2 3
Upvotes: 3
Reputation: 194
library(plyr)
library(reshape)
data <- data.frame(data)
a <- ddply(data,.(CAT,SEG),summarize,unq=length(unique(PRODUCT)))
b <- cast(a,CAT~SEG,mean)
This will generate NaN at places where the count of unique values = 0
Upvotes: 0
Reputation: 4335
> set.seed(1)
> mydf <- data.frame(
+ Values = rep(c("111", "222", "333"), times = c(5, 3, 2)),
+ Year = c(rep(c("1999", "2000"), times = c(3, 2)),
+ "1999", "1999", "2000", "2000", "2000"),
+ Month = sample(c("Jan", "Feb", "Mar"), 10, replace = TRUE)
+ )
> mydf
Values Year Month
1 111 1999 Jan
2 111 1999 Feb
3 111 1999 Feb
4 111 2000 Mar
5 111 2000 Jan
6 222 1999 Mar
7 222 1999 Mar
8 222 2000 Feb
9 333 2000 Feb
10 333 2000 Jan
> with(mydf, tapply(Month, list(Values, Year), FUN = function(x) length(unique(x))))
1999 2000
111 2 2
222 1 1
333 NA 2
>
For your example:
> data
CAT SEG PRODUCT
1 CAT1 SEG1 a
2 CAT3 SEG3 a
3 CAT3 SEG3 a
4 CAT1 SEG2 a
5 CAT2 SEG2 d
6 CAT3 SEG2 e
7 CAT3 SEG3 b
8 CAT3 SEG3 c
9 CAT3 SEG2 a
10 CAT2 SEG2 a
> with(data, tapply(PRODUCT, list(CAT, SEG), FUN = function(x) length(unique(x))))
SEG1 SEG2 SEG3
CAT1 1 1 NA
CAT2 NA 2 NA
CAT3 NA 2 3
Upvotes: 0