Achal Neupane
Achal Neupane

Reputation: 5719

English dictionary based word count in R

I am trying to do some text analysis and I was wondering if there is any tool or package that recognize different forms of English words (eg. singular, plural, past, present, etc) and get the word counts.

In this string vector myvec <- c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital", "hospitalization", "Hospitalized"), I want to get the count for word Fire = 4 and word Hospital = 5.

Upvotes: 1

Views: 260

Answers (3)

M.Viking
M.Viking

Reputation: 5398

A stemming example using the Quanteda library. https://quanteda.io/

install.packages("quanteda")

library(quanteda)

mytext = c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital", "hospitalization", "Hospitalized")

mytoks <- tokens(mytext)

toks_stem <- tokens_wordstem(mytoks, "english")
# tokens from 9 documents.
#[1] "fire",  "fire", "fire", "fire", "hospit", "Hospit", "hospit", "hospit", "Hospit"

Quanteda Cheatsheet - https://github.com/rstudio/cheatsheets/blob/master/quanteda.pdf

Upvotes: 0

M.Viking
M.Viking

Reputation: 5398

check out the Stemming technique.

Stemming - The process of reducing inflected (or sometimes derived) words to their root form. (e.g. "close" will be the root for "closed", "closing", "close", "closer" etc).

install.packages("tm")
library(tm)

mydf <- data.frame(doc_id = seq(1:9), 
                    text = c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital", "hospitalization", "Hospitalized"), 
                    stringsAsFactors = FALSE)

mycorpus <- SimpleCorpus(DataframeSource(mydf))

mytmmap <- tm_map(mycorpus, stemDocument, language = "english")  

inspect(mycorpus)

inspect(mytmmap)

# <<SimpleCorpus>>
# Metadata:  corpus specific: 1, document level (indexed): 0
# Content:  documents: 9
#
#     1      2      3      4      5      6      7      8      9 
#  fire   fire   fire   fire hospit Hospit hospit hospit Hospit 

Upvotes: 3

user11937744
user11937744

Reputation:

A better option would be stringdist, but this would work

f1 <- function(patVec, vec, nameVec) {
       out <- colSums(sapply(patVec, agrepl, x = vec,
             max.distance = 0.1, ignore.case = TRUE))
       names(out) <- nameVec
       out
    }
        
o1 <-  f1(c("fire", "hospital"), myvec, c("Fire", "Hospital"))
          
o1
#    Fire Hospital 
#       4        3 

For second vector

o1 <- f1(c("fire", "hospital"), myvec2, c("Fire", "Hospital"))
o1
#    Fire Hospital 
#      4        5 

Or use soundex

library(phonics)
o2 <- table(substr(soundex(myvec), 1, 2))
names(o2) <- c("Fire", "Hospital")
o2
#   Fire Hospital 
#      4        3 

For the second vector

o2 <- table(substr(soundex(myvec2), 1, 2))
names(o2) <- c("Fire", "Hospital")
o2
#    Fire Hospital 
#       4        5 

All the methods give the expected output in the OP's post

data

myvec <- c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital")
myvec2 <- c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital", "hospitalization", "Hospitalized")

Upvotes: 0

Related Questions