Reputation: 21
My data frame looks something like this:
id fruit1 fruit2 fruit3
1 apple banana orange
2 banana
3 apple apple
4 banana apple
5 orange apple
Is there a way to find the percent of how many individuals said each character string, each with a denominator of 5?
So the results would be apple = .8, banana = .6 and orange = .4
The actual database I am working with is huge so it would be great if the solution that didn't require typing each character string.
Upvotes: 2
Views: 55
Reputation: 388962
Using base R, we can rep
eat id
column and count frequencies using table
, calculate number of values greater than 0 in each column.
colMeans(table(rep(df$id, ncol(df) - 1), unlist(df[-1])) > 0)
# apple banana orange
# 0.8 0.6 0.4
data
I am asusming the blanks are NA
here.
df <- structure(list(id = 1:5, fruit1 = c("apple", "banana", "apple",
"banana", "orange"), fruit2 = c("banana", NA, "apple", "apple", "apple"),
fruit3 = c("orange", NA, NA, NA, NA)), row.names = c(NA, -5L),
class = "data.frame")
Upvotes: 0
Reputation: 226172
Construct data
dd <- read.csv(header=TRUE,text="
id,fruit1,fruit2,fruit3
1,apple,banana,orange
2,banana,,
3,apple,apple,
4,banana,apple,
5,orange,apple")
Set up results structure
## collapse data frame and exclude empty string
all_vals <- setdiff(unique(unlist(dd[,-1])),"")
res <- numeric(length(all_vals))
names(res) <- all_vals
Count
for (n in all_vals) {
has_string <- apply(dd,1, function(x) n %in% unlist(x))
res[n] <- mean(has_string)
}
res
## apple banana orange
## 0.8 0.6 0.4
If you're worried about performance, it might be faster to convert the non-id
columns to a character matrix up front (you could then skip all the unlist()
steps). Don't know how performance compares to @akrun's tidyverse solution.
Upvotes: 1
Reputation: 887078
An option in tidyverse would be reshape into 'long' format, then do pivot back to 'wide' and get the mean
library(dplyr)
library(tidyr)
dd %>%
# // reshape to long format
pivot_longer(cols = -id) %>%
# // remove the blank rows
filter(value != '') %>%
# // get the distinct rows
distinct(id, value) %>%
# // reshape to wide format
pivot_wider(names_from = value, values_from = value,
values_fn = list(value = length), values_fill = list(value = 0)) %>%
# get the mean of columns
summarise(across(apple:orange, mean))
# A tibble: 1 x 3
# apple banana orange
# <dbl> <dbl> <dbl>
#1 0.8 0.6 0.4
dd <- structure(list(id = 1:5, fruit1 = c("apple", "banana", "apple",
"banana", "orange"), fruit2 = c("banana", "", "apple", "apple",
"apple"), fruit3 = c("orange", "", "", "", "")),
class = "data.frame", row.names = c(NA,
-5L))
Upvotes: 4