RvS
RvS

Reputation: 169

Count the total and percentage of values in character class column

In a large dataset I have a column (treatment) which contains numbers, sometimes more then one, but is classed as character.

treatment <- c("1", "1", "2", "5", "1,2", "2,5", "1,2,5", "3") 
df <- data.frame(treatment)

Each number represents a treatment. I want to count the total number and percentage of each treatment.

Desired output:

treatment   number     percent
   1           4          50
   2           4          50
   3           1          12,5
   5           3          37,5

I've tried to sum and to use group_by(treatment) %>% summarise(percent = 100 *n() / nrow(df) But i'm having trouble with the rows with multiple numbers and it's class character. Any tips?

Upvotes: 1

Views: 488

Answers (1)

jpsmith
jpsmith

Reputation: 17215

You can use tidy::separate_longer_delim to separate out the combined values by "," - i.e.:

library(dplyr)
library(tidyr)

separate_longer_delim(df, treatment, ",")
#   treatment
#1          1
#2          1
#3          2
#4          5
#5          1
#6          2
#7          2
#8          5
#9          1
#10         2
#11         5
#12         3

Then simply use dplyr::count() to count and dplyr::mutate to get the percents. Note, since you mentioned each row is a contact, you can create a simple element to identify the number of contacts.

All together (and thanks to @DarrenTsai for the improved approach):

contacts <- length(df$treatment)

separate_longer_delim(df, treatment, ",") %>%
  count(treatment) %>%
  mutate(percent = n / contacts)

Output

  treatment n percent
1         1 4   0.500
2         2 4   0.500
3         3 1   0.125
4         5 3   0.375

Note, to get the same results with older versions of tidyr (<v1.3):

xx <- max(nchar(gsub(",", "", df$treatment)))
contacts <- length(df$treatment)

df %>%
  separate(treatment, into = paste0("col", seq_len(xx)), sep = ",") %>%
  pivot_longer(everything(), values_drop_na = TRUE, names_to = NULL, values_to = "treatment") %>%
  count(treatment) %>%
  mutate(percent = n / contacts)

#  treatment     n percent
#  <chr>     <int>   <dbl>
#1 1             4   0.5  
#2 2             4   0.5  
#3 3             1   0.125
#4 5             3   0.375

Upvotes: 1

Related Questions