Reputation: 5719
I am trying to do some text analysis and I was wondering if there is any tool or package that recognize different forms of English words (eg. singular, plural, past, present, etc) and get the word counts.
In this string vector myvec <- c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital", "hospitalization", "Hospitalized")
, I want to get the count for word Fire
= 4 and word Hospital
= 5.
Upvotes: 1
Views: 260
Reputation: 5398
A stemming example using the Quanteda
library. https://quanteda.io/
install.packages("quanteda")
library(quanteda)
mytext = c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital", "hospitalization", "Hospitalized")
mytoks <- tokens(mytext)
toks_stem <- tokens_wordstem(mytoks, "english")
# tokens from 9 documents.
#[1] "fire", "fire", "fire", "fire", "hospit", "Hospit", "hospit", "hospit", "Hospit"
Quanteda Cheatsheet - https://github.com/rstudio/cheatsheets/blob/master/quanteda.pdf
Upvotes: 0
Reputation: 5398
check out the Stemming technique.
Stemming - The process of reducing inflected (or sometimes derived) words to their root form. (e.g. "close" will be the root for "closed", "closing", "close", "closer" etc).
install.packages("tm")
library(tm)
mydf <- data.frame(doc_id = seq(1:9),
text = c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital", "hospitalization", "Hospitalized"),
stringsAsFactors = FALSE)
mycorpus <- SimpleCorpus(DataframeSource(mydf))
mytmmap <- tm_map(mycorpus, stemDocument, language = "english")
inspect(mycorpus)
inspect(mytmmap)
# <<SimpleCorpus>>
# Metadata: corpus specific: 1, document level (indexed): 0
# Content: documents: 9
#
# 1 2 3 4 5 6 7 8 9
# fire fire fire fire hospit Hospit hospit hospit Hospit
Upvotes: 3
Reputation:
A better option would be stringdist
, but this would work
f1 <- function(patVec, vec, nameVec) {
out <- colSums(sapply(patVec, agrepl, x = vec,
max.distance = 0.1, ignore.case = TRUE))
names(out) <- nameVec
out
}
o1 <- f1(c("fire", "hospital"), myvec, c("Fire", "Hospital"))
o1
# Fire Hospital
# 4 3
For second vector
o1 <- f1(c("fire", "hospital"), myvec2, c("Fire", "Hospital"))
o1
# Fire Hospital
# 4 5
Or use soundex
library(phonics)
o2 <- table(substr(soundex(myvec), 1, 2))
names(o2) <- c("Fire", "Hospital")
o2
# Fire Hospital
# 4 3
For the second vector
o2 <- table(substr(soundex(myvec2), 1, 2))
names(o2) <- c("Fire", "Hospital")
o2
# Fire Hospital
# 4 5
All the methods give the expected output in the OP's post
myvec <- c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital")
myvec2 <- c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital", "hospitalization", "Hospitalized")
Upvotes: 0