Reputation: 1441
How do I find the count of occurrences of a list of words? I can search for one word as follows:
dplyr::filter(data, grepl("apple", data$content,ignore.case = TRUE))
length(x$content)
The |
separator allows me to sum up all occurrences. But I want to count each word individually.
The words could be supplied as a row in a csv or written as a vector in R itself, e.g.:
words <- c("apple","orange","pear","pineapple")
One wrinkle is that the data$count
are a column of tweets so the word can occur more than once per tweet. So I'd like to count only if they occur in the row.
Upvotes: 0
Views: 106
Reputation: 18425
Using the stringr
package...
library(stringr)
words <- c("apple","orange","pear","pineapple")
data <- c("On my grocery list are green apples, red apples and oranges",
"Oranges are my favourite, but I also like pineapples and pearls")
sapply(words,function(w)
str_count(str_to_lower(str_split(data," ")), #split into words and set to lower case
paste0("\\b",w,"s*\\b"))) #adds word boundaries and optional plural -s
apple orange pear pineapple
[1,] 2 1 0 0
[2,] 0 1 0 1
This allows for capital letters, and should only count whole words (perhaps with an -s plural).
Upvotes: 0
Reputation: 10671
You could get logical
values for the presence/absence of your target words like this:
library(tidyverse)
words <- c("apple","orange","pear","pineapple")
data <- tibble(content = c("Ony my grocery list are green apples, red apples and oranges",
"My favorite froyo flavors are pineapple, peach-pear and pear"))
boundary_words <- paste0("\\b", words) # if you want to avoid counting the apple in pineapple
map_dfc(boundary_words, ~ as.tibble(grepl(., data$content))) %>%
set_names(words) %>%
bind_cols(data, .)
# A tibble: 2 x 5
content apple orange pear pineapple
<chr> <lgl> <lgl> <lgl> <lgl>
1 Ony my grocery list are green apples, red apples and oranges TRUE TRUE FALSE FALSE
2 My favorite froyo flavors are pineapple, peach-pear and pear FALSE FALSE TRUE TRUE
Upvotes: 1