Reputation: 507
I have some survey data in which columns correspond to items and rows correspond to customers saying how likely they are to buy each item. Looks like this:
item1 = c("Likely", "Unlikely", "Very Likely","Likely")
item2 = c("Likely", "Unlikely", "Very Likely","Unlikely")
item3 = c("Very Likely", "Unlikely", "Very Likely","Likely")
df = data.frame(item1, item2, item3)
I want a summary table giving the percentage of each response for each item. Right now I'm using table() on each column for this process, and its a lot of code to manipulate. How can I do this using plyr or apply or something faster?
Current solution:
d1<-as.data.frame(table(df$item1))
d1$item1_percent<- d1$Freq/sum(d1$Freq)
names(d1)<-c("Response","item1_freqs","item1_percent")
d2<-as.data.frame(table(df$item2))
d2$item2_percent<- d2$Freq/sum(d2$Freq)
names(d2)<-c("Response","item2_freqs","item2_percent")
d3<-as.data.frame(table(df$item3))
d3$item3_percent<- d3$Freq/sum(d3$Freq)
names(d3)<-c("Response","item3_freqs","item3_percent")
results<-cbind(d1,d2[,2:3],d3[,2:3])
Note I don't really need the freq counts, just the percentages.
Thanks in advance!
Upvotes: 5
Views: 3328
Reputation: 175
I would suggest using a different way of organizing the data, using factor levels to differ the items. This makes it easy to work with the data. I will transform your data using the gather function, and then use summarise to calculate the frequency percentages:
library(tidyverse)
results <- df %>%
gather("item", "likelihood") %>%
group_by(item, likelihood) %>%
summarise(n = n() ) %>%
mutate(freq = n / sum(n))
# > results
# A tibble: 9 x 4
# Groups: item [3]
# item likelihood n freq
# <chr> <chr> <int> <dbl>
# 1 item1 Likely 2 0.50
# 2 item1 Unlikely 1 0.25
# 3 item1 Very Likely 1 0.25
# 4 item2 Likely 1 0.25
# 5 item2 Unlikely 2 0.50
# 6 item2 Very Likely 1 0.25
# 7 item3 Likely 1 0.25
# 8 item3 Unlikely 1 0.25
# 9 item3 Very Likely 2 0.50
I used dplyr and broom for this, but I prefer using the tidyverse library, since it loads both packages at once.
Edit: If you want to use keep the frequencies as columns, you can use spread to do so:
col_results <- results %>%
select(-n) %>%
spread(item, freq)
# > col_results
# A tibble: 3 x 4
# likelihood item1 item2 item3
# * <chr> <dbl> <dbl> <dbl>
# 1 Likely 0.50 0.25 0.25
# 2 Unlikely 0.25 0.50 0.25
# 3 Very Likely 0.25 0.25 0.50
Upvotes: 2
Reputation: 25904
As you have the same range of values in each item# you can use
sapply(df, function(x) prop.table(table(x)))
# item1 item2 item3
# Likely 0.50 0.25 0.25
# Unlikely 0.25 0.50 0.25
# Very Likely 0.25 0.25 0.50
But if they were different you can set each item# to have a common set of levels
df[] <- lapply(df, factor, levels=unique(unlist(df)))
sapply(df, function(x) prop.table(table(x)))
Upvotes: 6
Reputation: 994
Using dplyr
:
results = data.frame(df %>%
group_by(item1) %>%
summarise(no_rows=length(item1)/nrow(df)))
results = cbind(results,
data.frame(df %>%
group_by(item2) %>%
summarise(no_rows=length(item2)/nrow(df))))
results = cbind(results,
data.frame(df %>%
group_by(item3) %>%
summarise(no_rows=length(item3)/nrow(df))))
# > results
# item1 no_rows item2 no_rows item3 no_rows
# 1 Likely 0.50 Likely 0.25 Likely 0.25
# 2 Unlikely 0.25 Unlikely 0.50 Unlikely 0.25
# 3 Very Likely 0.25 Very Likely 0.25 Very Likely 0.50
Upvotes: 3
Reputation: 107747
Consider the chain merge with Reduce
where you first loop through each column of dataframe by number with lapply
to build a list of dataframes that is then passed into merge
on Response:
dfList <- lapply(seq_along(df), function(i){
d <- as.data.frame(table(df[,i]))
d$item1_percent <- d$Freq/sum(d$Freq)
# PASS COLUMN NUMBER INTO DF COLUMN NAMES
names(d) <- c("Response", paste0("item",i,"_freqs"), paste0("item",i,"_percent"))
return(d)
})
results2 <- Reduce(function(x,y) merge(x, y, by="Response", all.equal=TRUE), dfList)
# EQUIVALENT TO ORIGINAL results
all.equal(results, results2)
# [1] TRUE
identical(results, results2)
# [1] TRUE
Upvotes: 3