haver24
haver24

Reputation: 251

cross table that counts unique values in R

My data consists of 3 columns: - segment - category - product number

How can I create a cross tab ("segment" as columns, "category" as rows) that counts unique values of "product number" (example below) ?

      SEG1  SEG2  SEG3
CAT1   X
CAT2
CAT3

X - number of unique values from CAT1 and SEG1

Exmple data

CAT<-c("CAT1","CAT3","CAT3","CAT1","CAT2","CAT3","CAT3","CAT3","CAT3","CAT2") 
SEG<-c("SEG1","SEG3","SEG3","SEG2","SEG2","SEG2","SEG3","SEG3","SEG2","SEG2")   
PRODUCT<-c("a","a","a","a","d","e","b","c","a","a")
data<-cbind(CAT,SEG,PRODUCT)

Many thanks in advance ! Best regards, Bartek

Upvotes: 2

Views: 1709

Answers (6)

JWilliman
JWilliman

Reputation: 3883

A couple more suggestions if your data is in a data frame. Can use dplyr::n_distinct instead of n_unique as defined below.

data = data.frame(
  CAT = factor(CAT),
  SEG = factor(SEG),
  PRODUCT = as.character(PRODUCT)
)

n_unique <- function(x) length(unique(x))

Using base R

dat_counts <- aggregate(PRODUCT ~ CAT + SEG, data = data, FUN = n_unique)
xtabs(PRODUCT ~ CAT + SEG, data = dat_counts)

Using the tables package. Note, the grouping variables (CAT and SEG) need to be factors, and the value variable (PRODUCT) needs to be a character variable.

tables::tabular(
  CAT ~ SEG * PRODUCT * n_unique
  , data = data)

These are fine for small datasets, but the other approaches are faster for larger datasets.

Upvotes: 0

fvfaleiro
fvfaleiro

Reputation: 303

High speed solution with dplyr and tidyr packages.

library(dplyr)
library(tidyr)


CAT <- c("CAT1","CAT3","CAT3","CAT1","CAT2","CAT3","CAT3","CAT3","CAT3","CAT2") 
SEG <- c("SEG1","SEG3","SEG3","SEG2","SEG2","SEG2","SEG3","SEG3","SEG2","SEG2")   
PRODUCT <- c("a","a","a","a","d","e","b","c","a","a")
data <- data.frame(CAT, SEG, PRODUCT)

# Elegant solution with pipes (%>%)
data %>% 
  group_by(CAT, SEG) %>%  
  summarize(uni.prod = n_distinct(PRODUCT)) %>%  
  spread(CAT, uni.prod)

# Solution without use pipes
groups <- group_by(data, CAT, SEG)
s <- summarize(groups, uni.prod = n_distinct(PRODUCT))
spread(s, CAT, uni.prod)

Upvotes: 1

Aayush Agrawal
Aayush Agrawal

Reputation: 194

In case you are using a data.table, you could really speed the operation up for larger dataframes. You could use

    library(data.table)
    library(reshape)
    DF<-data.table(DF)
    DF_agg<-DF[,j=list(count_prod=length(unique(DF$Product_Number)),by=c("Segment","Category")]
    DF_agg<-cast(DF_agg,Segment~Category,sum)

Upvotes: 1

Cath
Cath

Reputation: 24074

You can simply compute a crosstable of your data but without duplicated lines to be sure to only count unique product numbers:

nodup <- which(!duplicated(data))
table(data[nodup, "CAT"],data[nodup, "SEG"])

       SEG1 SEG2 SEG3
  CAT1    1    1    0
  CAT2    0    2    0
  CAT3    0    2    3

Upvotes: 3

Aayush Agrawal
Aayush Agrawal

Reputation: 194

library(plyr)
library(reshape)
data <- data.frame(data)
a <- ddply(data,.(CAT,SEG),summarize,unq=length(unique(PRODUCT)))
b <- cast(a,CAT~SEG,mean)

This will generate NaN at places where the count of unique values = 0

Upvotes: 0

Prasanna Nandakumar
Prasanna Nandakumar

Reputation: 4335

> set.seed(1)
> mydf <- data.frame(
+     Values = rep(c("111", "222", "333"), times = c(5, 3, 2)),
+     Year = c(rep(c("1999", "2000"), times = c(3, 2)), 
+            "1999", "1999", "2000", "2000", "2000"),
+     Month = sample(c("Jan", "Feb", "Mar"), 10, replace = TRUE)
+ )
> mydf
   Values Year Month
1     111 1999   Jan
2     111 1999   Feb
3     111 1999   Feb
4     111 2000   Mar
5     111 2000   Jan
6     222 1999   Mar
7     222 1999   Mar
8     222 2000   Feb
9     333 2000   Feb
10    333 2000   Jan
> with(mydf, tapply(Month, list(Values, Year), FUN = function(x) length(unique(x))))
    1999 2000
111    2    2
222    1    1
333   NA    2
>

For your example:

> data
    CAT  SEG PRODUCT
1  CAT1 SEG1       a
2  CAT3 SEG3       a
3  CAT3 SEG3       a
4  CAT1 SEG2       a
5  CAT2 SEG2       d
6  CAT3 SEG2       e
7  CAT3 SEG3       b
8  CAT3 SEG3       c
9  CAT3 SEG2       a
10 CAT2 SEG2       a
> with(data, tapply(PRODUCT, list(CAT, SEG), FUN = function(x) length(unique(x))))
     SEG1 SEG2 SEG3
CAT1    1    1   NA
CAT2   NA    2   NA
CAT3   NA    2    3

Upvotes: 0

Related Questions