Reputation: 307
I would like to count the number of English words in a string of text.
df.words <- data.frame(ID = 1:2,
text = c(c("frog friend fresh frink foot"),
c("get give gint gobble")))
df.words
ID text
1 1 frog friend fresh frink foot
2 2 get give gint gobble
I'd like the final product to look like this:
ID text count
1 1 frog friend fresh frink foot 4
2 2 get give gint gobble 3
I'm guessing I'll have to first separate based on spaces and then reference the words against a dictionary?
Upvotes: 0
Views: 188
Reputation: 160687
Base R alternative, using EJJ's great recommendation for dict
:
sapply(strsplit(df.words$text, "\\s+"),
function(z) sum(z %in% dict$text2))
# [1] 4 3
I thought that this would be a clear winner in speed, but apparently doing sum(. %in% .)
one at a time can be a little expensive. (It is slower with this data.)
Faster but not necessarily simpler:
words <- strsplit(df.words$text, "\\s+")
words <- sapply(words, `length<-`, max(lengths(words)))
found <- array(words %in% dict$text2, dim = dim(words))
colSums(found)
# [1] 4 3
It's a hair faster (~ 10-15%) than EJJ's solution, so likely only a good thing if you need to wring some performance out of it.
(Caveat: EJJ's is faster with this 2-row dataset. If the data is 1000x larger, then my first solution is a little faster, and my second solution is twice as fast. Benchmarks are benchmarks, though, don't optimize code beyond usability if speed/time is not a critical factor.)
Upvotes: 1
Reputation: 1513
Building on @r2evans suggestion of using strsplit()
and using a random English word .txt file dictionary online, example is below. This solution probably might not scale well if you have a large number of comparisons because of the unnest
step.
library(dplyr)
library(tidyr)
# text file with 479k English words ~4MB
dict <- read.table(file = url("https://github.com/dwyl/english-words/raw/master/words_alpha.txt"), col.names = "text2")
df.words <- data.frame(ID = 1:2,
text = c(c("frog friend fresh frink foot"),
c("get give gint gobble")),
stringsAsFactors = FALSE)
df.words %>%
mutate(text2 = strsplit(text, split = "\\s")) %>%
unnest(text2) %>%
semi_join(dict, by = c("text2")) %>%
group_by(ID, text) %>%
summarise(count = length(text2))
Output
ID text count
<int> <chr> <int>
1 1 frog friend fresh frink foot 4
2 2 get give gint gobble 3
Upvotes: 1