Reputation: 169
In a large dataset I have a column (treatment) which contains numbers, sometimes more then one, but is classed as character.
treatment <- c("1", "1", "2", "5", "1,2", "2,5", "1,2,5", "3")
df <- data.frame(treatment)
Each number represents a treatment. I want to count the total number and percentage of each treatment.
Desired output:
treatment number percent
1 4 50
2 4 50
3 1 12,5
5 3 37,5
I've tried to sum and to use group_by(treatment) %>% summarise(percent = 100 *n() / nrow(df)
But i'm having trouble with the rows with multiple numbers and it's class character.
Any tips?
Upvotes: 1
Views: 488
Reputation: 17215
You can use tidy::separate_longer_delim
to separate out the combined values by ","
- i.e.:
library(dplyr)
library(tidyr)
separate_longer_delim(df, treatment, ",")
# treatment
#1 1
#2 1
#3 2
#4 5
#5 1
#6 2
#7 2
#8 5
#9 1
#10 2
#11 5
#12 3
Then simply use dplyr::count()
to count and dplyr::mutate
to get the percents. Note, since you mentioned each row is a contact, you can create a simple element to identify the number of contacts.
All together (and thanks to @DarrenTsai for the improved approach):
contacts <- length(df$treatment)
separate_longer_delim(df, treatment, ",") %>%
count(treatment) %>%
mutate(percent = n / contacts)
Output
treatment n percent
1 1 4 0.500
2 2 4 0.500
3 3 1 0.125
4 5 3 0.375
Note, to get the same results with older versions of tidyr
(<v1.3):
xx <- max(nchar(gsub(",", "", df$treatment)))
contacts <- length(df$treatment)
df %>%
separate(treatment, into = paste0("col", seq_len(xx)), sep = ",") %>%
pivot_longer(everything(), values_drop_na = TRUE, names_to = NULL, values_to = "treatment") %>%
count(treatment) %>%
mutate(percent = n / contacts)
# treatment n percent
# <chr> <int> <dbl>
#1 1 4 0.5
#2 2 4 0.5
#3 3 1 0.125
#4 5 3 0.375
Upvotes: 1