Reputation: 343
Im currently doing a textmining process in which I would like to transform similar words (table, tables etc...) into one word (table)'. I saw that the tm package offers a tool for this but this one does not support the language Im looking for. So therefore I want to create something myself.
For the function I want to have a link table ->
a <- c("Table", "Tables", "Tree", "Trees")
b <- c("Table", "Tree", "Chair", "Invoice")
df <- data.frame(b, a)
So that i can automatically transfer all the "Tables" values into "Table"
Any thoughts on how I can do this?
Upvotes: 0
Views: 696
Reputation: 13856
Search for stemming in R, you can look here and you can try :
a <- c("Table", "Tables", "Tree", "Trees")
b <- c("Table", "Tree", "Chair", "Invoice")
library("SnowballC")
wordStem(words = a, language = "porter")
##[1] "Tabl" "Tabl" "Tree" "Tree"
library("tm") # tm use wordStem
stemCompletion(x = stemDocument(x = a), dictionary = b)
## Tabl Tabl Tree Tree
##"Table" "Table" "Tree" "Tree"
Or more complicated to use but more complete, you can look at package korPus
and use TreeTagger to process your text :
library("koRpus")
tagged.results <- treetag(tolower(a), treetagger="manual", format="obj",
TT.tknz=FALSE , lang="en",
TT.options=list(path="./TreeTagger", preset="en"))
[email protected]
## token tag lemma lttr wclass desc stop stem
##1 table NN table 5 noun Noun, singular or mass NA NA
##2 tables NNS table 6 noun Noun, plural NA NA
##3 tree NN tree 4 noun Noun, singular or mass NA NA
##4 trees NNS tree 5 noun Noun, plural NA NA
What you want is in :
[email protected]$lemma
##[1] "table" "table" "tree" "tree"
Upvotes: 5