Reputation: 45
I have a dataframe which was originally a survey where people could answer their dog breeds and their favorit dog breed. Obviously each person can have multiple answers which will be displayed in the same categorie but separated with a comma. However I can't figure out how to count the number of times each breed was mensionned for their respective category.
here is the code:
dogs_owned <-c("labrador, golden", "golden","pitbull, chihuahua")
dogs_fav <- c("beagle", "labrador, shepherd", "chihuahua, pitbull")
test <- data.frame(dogs_owned,dogs_fav)
list <- c("labrador", "golden","pitbull","chihuahua","beagle","shepherd")
list_test <- data.frame(list)
list_test$count_own <- 0
list_test$count_fav <- 0
The goal is to count how many times the name of each breed of dog appear in either dogs_owned and dogs_fav in their respective list count
Upvotes: 2
Views: 36
Reputation: 61154
You can use strsplit
to split strings and then count them using table
> (owned <- as.data.frame(table(trimws(unlist(strsplit(test$dogs_owned, split=","))))))
Var1 Freq
1 chihuahua 1
2 golden 2
3 labrador 1
4 pitbull 1
> (fav <- as.data.frame(table(trimws(unlist(strsplit(test$dogs_fav, split=","))))))
Var1 Freq
1 beagle 1
2 chihuahua 1
3 labrador 1
4 pitbull 1
5 shepherd 1
If you wish them in one data.frame, we can use full_join
from dplyr package as alternative to merge
(see r2evans' answer)
> library(dplyr)
owned %>%
full_join(fav, by="Var1") %>%
rename(Owned = Freq.x,
Fav = Freq.y)
Var1 Owned Fav
1 chihuahua 1 1
2 golden 2 NA
3 labrador 1 1
4 pitbull 1 1
5 beagle NA 1
6 shepherd NA 1
Upvotes: 2
Reputation: 160447
Try this:
out <- Reduce(function(a, b) merge(a, b, by="Var1", all=TRUE),
lapply(test,
function(z) as.data.frame.table(table(unlist(strsplit(z, "[ ,]+"))),
stringsAsFactors=FALSE)))
names(out)[-1] <- names(test)
out
# Var1 dogs_owned dogs_fav
# 1 beagle NA 1
# 2 chihuahua 1 1
# 3 golden 2 NA
# 4 labrador 1 1
# 5 pitbull 1 1
# 6 shepherd NA 1
Brief walk-through:
strsplit(z, "[ ,]+")
splits a string by consecutive commas and spaces (this is a trick to remove the need for trimws
, which also works well); this results in
strsplit(test$dogs_fav, "[ ,]+")
# [[1]]
# [1] "beagle"
# [[2]]
# [1] "labrador" "shepherd"
# [[3]]
# [1] "chihuahua" "pitbull"
table(unlist(.))
converts that to a vector and counts them
table(unlist(strsplit(test$dogs_fav, "[ ,]+")))
# beagle chihuahua labrador pitbull shepherd
# 1 1 1 1 1
as.data.frame.table(.)
converts that to a 2-column frame with names Var1
and Freq
merge(a, b, by="Var1", ...)
combines the 2-column frames by the Var1
column (dog type), keeping each Freq
column unique. Using all=TRUE
ensures dogs found in one and not another are preserved
Reduce
allows us to do this for a dynamic number of columns, so if we had more dogs_*
columns, this same code works unchanged
Upvotes: 1