Marc van der Peet
Marc van der Peet

Reputation: 343

Classify similar words

Im currently doing a textmining process in which I would like to transform similar words (table, tables etc...) into one word (table)'. I saw that the tm package offers a tool for this but this one does not support the language Im looking for. So therefore I want to create something myself.

For the function I want to have a link table ->

 a <- c("Table", "Tables", "Tree", "Trees")
 b <- c("Table", "Tree", "Chair", "Invoice")
 df <- data.frame(b, a)

So that i can automatically transfer all the "Tables" values into "Table"

Any thoughts on how I can do this?

Upvotes: 0

Views: 696

Answers (1)

Victorp
Victorp

Reputation: 13856

Search for stemming in R, you can look here and you can try :

a <- c("Table", "Tables", "Tree", "Trees")
b <- c("Table", "Tree", "Chair", "Invoice")
library("SnowballC")
wordStem(words = a, language = "porter")
##[1] "Tabl" "Tabl" "Tree" "Tree"
library("tm") # tm use wordStem
stemCompletion(x = stemDocument(x = a), dictionary = b)
##   Tabl    Tabl    Tree    Tree 
##"Table" "Table"  "Tree"  "Tree" 

Or more complicated to use but more complete, you can look at package korPus and use TreeTagger to process your text :

library("koRpus")
tagged.results <- treetag(tolower(a), treetagger="manual", format="obj",
                          TT.tknz=FALSE , lang="en",
                          TT.options=list(path="./TreeTagger", preset="en"))
[email protected]
##   token tag lemma lttr wclass                   desc stop stem
##1  table  NN table    5   noun Noun, singular or mass   NA   NA
##2 tables NNS table    6   noun           Noun, plural   NA   NA
##3   tree  NN  tree    4   noun Noun, singular or mass   NA   NA
##4  trees NNS  tree    5   noun           Noun, plural   NA   NA

What you want is in :

[email protected]$lemma
##[1] "table" "table" "tree"  "tree" 

Upvotes: 5

Related Questions