Reputation: 507

More efficient way to get frequency counts across columns of data frame

I have some survey data in which columns correspond to items and rows correspond to customers saying how likely they are to buy each item. Looks like this:

item1 = c("Likely", "Unlikely", "Very Likely","Likely") 
item2 = c("Likely", "Unlikely", "Very Likely","Unlikely")
item3 = c("Very Likely", "Unlikely", "Very Likely","Likely") 
df = data.frame(item1, item2, item3)

I want a summary table giving the percentage of each response for each item. Right now I'm using table() on each column for this process, and its a lot of code to manipulate. How can I do this using plyr or apply or something faster?

Current solution:

d1<-as.data.frame(table(df$item1))
d1$item1_percent<- d1$Freq/sum(d1$Freq)
names(d1)<-c("Response","item1_freqs","item1_percent")

d2<-as.data.frame(table(df$item2))
d2$item2_percent<- d2$Freq/sum(d2$Freq)
names(d2)<-c("Response","item2_freqs","item2_percent")

d3<-as.data.frame(table(df$item3))
d3$item3_percent<- d3$Freq/sum(d3$Freq)
names(d3)<-c("Response","item3_freqs","item3_percent")

results<-cbind(d1,d2[,2:3],d3[,2:3])

Note I don't really need the freq counts, just the percentages.

Thanks in advance!

Upvotes: 5

Answers (4)

Sollano Rabelo Braga

Reputation: 175

I would suggest using a different way of organizing the data, using factor levels to differ the items. This makes it easy to work with the data. I will transform your data using the gather function, and then use summarise to calculate the frequency percentages:

library(tidyverse)

results <- df %>% 
 gather("item", "likelihood") %>% 
 group_by(item, likelihood) %>% 
 summarise(n = n() ) %>% 
 mutate(freq = n / sum(n))

# > results
#  A tibble: 9 x 4
#  Groups:   item [3]
#    item  likelihood     n  freq
#   <chr>       <chr> <int> <dbl>
# 1 item1      Likely     2  0.50
# 2 item1    Unlikely     1  0.25
# 3 item1 Very Likely     1  0.25
# 4 item2      Likely     1  0.25
# 5 item2    Unlikely     2  0.50
# 6 item2 Very Likely     1  0.25
# 7 item3      Likely     1  0.25
# 8 item3    Unlikely     1  0.25
# 9 item3 Very Likely     2  0.50

I used dplyr and broom for this, but I prefer using the tidyverse library, since it loads both packages at once.

Edit: If you want to use keep the frequencies as columns, you can use spread to do so:

col_results <- results %>% 
  select(-n) %>% 
  spread(item, freq)

# > col_results
# A tibble: 3 x 4
#   likelihood item1 item2 item3
# *       <chr> <dbl> <dbl> <dbl>
# 1      Likely  0.50  0.25  0.25
# 2    Unlikely  0.25  0.50  0.25
# 3 Very Likely  0.25  0.25  0.50

Upvotes: 2

user20650

Reputation: 25904

As you have the same range of values in each item# you can use

sapply(df, function(x) prop.table(table(x)))
#             item1 item2 item3
# Likely       0.50  0.25  0.25
# Unlikely     0.25  0.50  0.25
# Very Likely  0.25  0.25  0.50

But if they were different you can set each item# to have a common set of levels

df[] <- lapply(df, factor, levels=unique(unlist(df)))
sapply(df, function(x) prop.table(table(x)))

Upvotes: 6

Matt

Reputation: 994

Using dplyr:

results = data.frame(df %>% 
                     group_by(item1) %>% 
                             summarise(no_rows=length(item1)/nrow(df)))
results = cbind(results,
          data.frame(df %>%  
                     group_by(item2) %>%
                             summarise(no_rows=length(item2)/nrow(df))))

results = cbind(results,
          data.frame(df %>% 
                     group_by(item3) %>% 
                             summarise(no_rows=length(item3)/nrow(df))))


# > results
#        item1 no_rows       item2 no_rows       item3 no_rows
# 1      Likely    0.50      Likely    0.25      Likely    0.25
# 2    Unlikely    0.25    Unlikely    0.50    Unlikely    0.25
# 3 Very Likely    0.25 Very Likely    0.25 Very Likely    0.50

Upvotes: 3

Parfait

Reputation: 107747

Consider the chain merge with Reduce where you first loop through each column of dataframe by number with lapply to build a list of dataframes that is then passed into merge on Response:

dfList <- lapply(seq_along(df), function(i){      
  d <- as.data.frame(table(df[,i]))
  d$item1_percent <- d$Freq/sum(d$Freq)
  # PASS COLUMN NUMBER INTO DF COLUMN NAMES
  names(d) <- c("Response", paste0("item",i,"_freqs"), paste0("item",i,"_percent"))

  return(d)      
})

results2 <- Reduce(function(x,y) merge(x, y, by="Response", all.equal=TRUE), dfList)

# EQUIVALENT TO ORIGINAL results
all.equal(results, results2)
# [1] TRUE
identical(results, results2)
# [1] TRUE

Upvotes: 3

More efficient way to get frequency counts across columns of data frame

Answers (4)

Related Questions