How to count values separated by commas in a dataframe

Question

I have a dataframe which was originally a survey where people could answer their dog breeds and their favorit dog breed. Obviously each person can have multiple answers which will be displayed in the same categorie but separated with a comma. However I can't figure out how to count the number of times each breed was mensionned for their respective category.

here is the code:

dogs_owned <-c("labrador, golden", "golden","pitbull, chihuahua")
dogs_fav <- c("beagle", "labrador, shepherd", "chihuahua, pitbull")
test <- data.frame(dogs_owned,dogs_fav)

list <- c("labrador", "golden","pitbull","chihuahua","beagle","shepherd")
list_test <- data.frame(list)
list_test$count_own <- 0
list_test$count_fav <- 0

The goal is to count how many times the name of each breed of dog appear in either dogs_owned and dogs_fav in their respective list count

r2evans · Accepted Answer

Try this:

out <- Reduce(function(a, b) merge(a, b, by="Var1", all=TRUE),
  lapply(test,
    function(z) as.data.frame.table(table(unlist(strsplit(z, "[ ,]+"))), 
                                    stringsAsFactors=FALSE)))
names(out)[-1] <- names(test)
out
#        Var1 dogs_owned dogs_fav
# 1    beagle         NA        1
# 2 chihuahua          1        1
# 3    golden          2       NA
# 4  labrador          1        1
# 5   pitbull          1        1
# 6  shepherd         NA        1

Brief walk-through:

strsplit(z, "[ ,]+") splits a string by consecutive commas and spaces (this is a trick to remove the need for trimws, which also works well); this results in
```
strsplit(test$dogs_fav, "[ ,]+")
# [[1]]
# [1] "beagle"
# [[2]]
# [1] "labrador" "shepherd"
# [[3]]
# [1] "chihuahua" "pitbull"  
```

table(unlist(.)) converts that to a vector and counts them

table(unlist(strsplit(test$dogs_fav, "[ ,]+")))
#    beagle chihuahua  labrador   pitbull  shepherd 
#         1         1         1         1         1

as.data.frame.table(.) converts that to a 2-column frame with names Var1 and Freq
merge(a, b, by="Var1", ...) combines the 2-column frames by the Var1 column (dog type), keeping each Freq column unique. Using all=TRUE ensures dogs found in one and not another are preserved
Reduce allows us to do this for a dynamic number of columns, so if we had more dogs_* columns, this same code works unchanged

How to count values separated by commas in a dataframe

Answers (2)

Related Questions