KyBt
KyBt

Reputation: 21

Finding the percent of a character string across a data frame but only counting once per row

My data frame looks something like this:

id fruit1 fruit2 fruit3 
1  apple  banana orange
2  banana
3  apple  apple
4  banana apple
5  orange apple

Is there a way to find the percent of how many individuals said each character string, each with a denominator of 5?

So the results would be apple = .8, banana = .6 and orange = .4

The actual database I am working with is huge so it would be great if the solution that didn't require typing each character string.

Upvotes: 2

Views: 55

Answers (3)

Ronak Shah
Ronak Shah

Reputation: 388962

Using base R, we can repeat id column and count frequencies using table, calculate number of values greater than 0 in each column.

colMeans(table(rep(df$id, ncol(df) - 1), unlist(df[-1])) > 0)

# apple banana orange 
#   0.8    0.6    0.4 

data

I am asusming the blanks are NA here.

df <- structure(list(id = 1:5, fruit1 = c("apple", "banana", "apple", 
"banana", "orange"), fruit2 = c("banana", NA, "apple", "apple", "apple"), 
fruit3 = c("orange", NA, NA, NA, NA)), row.names = c(NA, -5L), 
class = "data.frame")

Upvotes: 0

Ben Bolker
Ben Bolker

Reputation: 226172

Construct data

dd <- read.csv(header=TRUE,text="
id,fruit1,fruit2,fruit3 
1,apple,banana,orange
2,banana,,
3,apple,apple,
4,banana,apple,
5,orange,apple")

Set up results structure

## collapse data frame and exclude empty string
all_vals <- setdiff(unique(unlist(dd[,-1])),"")
res <- numeric(length(all_vals))
names(res) <- all_vals

Count

for (n in all_vals) {
    has_string <- apply(dd,1, function(x) n %in% unlist(x))
    res[n] <- mean(has_string)
}
res

## apple banana orange 
##  0.8    0.6    0.4 

If you're worried about performance, it might be faster to convert the non-id columns to a character matrix up front (you could then skip all the unlist() steps). Don't know how performance compares to @akrun's tidyverse solution.

Upvotes: 1

akrun
akrun

Reputation: 887078

An option in tidyverse would be reshape into 'long' format, then do pivot back to 'wide' and get the mean

library(dplyr)
library(tidyr)
dd %>% 
  # // reshape to long format
  pivot_longer(cols = -id) %>%
  # // remove the blank rows
  filter(value != '') %>%
  # // get the distinct rows
  distinct(id, value) %>% 
  # // reshape to wide format
  pivot_wider(names_from = value, values_from = value, 
      values_fn = list(value = length), values_fill = list(value = 0)) %>% 
  # get the mean of columns
  summarise(across(apple:orange, mean))
# A tibble: 1 x 3
#  apple banana orange
#  <dbl>  <dbl>  <dbl>
#1   0.8    0.6    0.4

data

dd <- structure(list(id = 1:5, fruit1 = c("apple", "banana", "apple", 
"banana", "orange"), fruit2 = c("banana", "", "apple", "apple", 
"apple"), fruit3 = c("orange", "", "", "", "")),
class = "data.frame", row.names = c(NA, 
-5L))

Upvotes: 4

Related Questions