Reputation: 540

How to complete a stemmed corpus from a dictionary using stemCompletion function (tm package)

I am having a trouble in the tm package of R. I am using 0.6.2 version. Following question (2 different errors) has already been answered here and here but still producing an error after using the posted solution. Please click here to download the dataset (93 rows only). It's a reproducible example. the two errors are below:

(Resolved) Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character"
Error: inherits(doc, "TextDocument") is not TRUE
tm_map(ds.corpus, PlainTextDocument) does not create a plain text document in this case. inherits(ds.cleanCorpus, "TextDocument") # returns FALSE

please tell me what is wrong in my approach.

  # Data import
    df.imp<- read.csv("Phone2_Sample100_NegPos.csv", header = TRUE, as.is = TRUE)

   ##### Data Pre-Processing 

        install.packages("tm")
    require(tm)  

    ds.corpus<- Corpus(VectorSource(df.imp$Content))

    ds.corpus<- tm_map(ds.corpus, content_transformer(tolower))
    ds.corpus<- tm_map(ds.corpus, content_transformer(removePunctuation))
    ds.corpus<- tm_map(ds.corpus, content_transformer(removeNumbers))
    removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)
    ds.corpus<- tm_map(ds.corpus,removeURL)

    stopwords.default<- stopwords("english")
    stopWordsNotDeleted<- c("isn't" ,     "aren't" ,    "wasn't" ,    "weren't"   , "hasn't"    ,
                            "haven't" ,   "hadn't"  ,   "doesn't" ,   "don't"      ,"didn't"    ,
                            "won't"   ,   "wouldn't",   "shan't"  ,   "shouldn't",  "can't"     ,
                            "cannot"    , "couldn't"  , "mustn't", "but","no", "nor", "not", "too", "very")

    stopWord.new<- stopwords.default[! stopwords.default %in% stopWordsNotDeleted] ## new Stopwords list
    ds.corpus<- tm_map(ds.corpus, removeWords, stopWord.new )

    copy<- ds.corpus ## creating a copy to be used as a dictionary

    ds.corpus<- tm_map(ds.corpus, stemDocument)

    ## error Statement #1
    ds.corpus<-  stemCompletion(ds.corpus, dictionary = copy) 
    ## Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character"




    ds.cleanCorpus<- tm_map(ds.corpus, PlainTextDocument) ## creating plain text document

    class(ds.cleanCorpus) ## output is VCorpus" "Corpus".  what it should be??

    ## error Statement #2
    tdm<- TermDocumentMatrix(ds.corpus) ## creating  term document matrix 

    inherits(ds.cleanCorpus, "TextDocument") ## returns FALSE

Update: Figured out first error, that the stemCompletion method's x parameter should be a character vector and dictionary could be either a corpus or character vector. However, when I tried it on first document (character vector) of ds.corpus, as below, stemmed words were not completed and output is just the stemmed character vector like before.

stemCompletion(ds.corpus[[1]]$content, dictionary = copy)

So now my main question is "How to complete a stemmed corpus from a dictionary (tm package)?" The stemCompletion method doesn't seems working (on a character vector). Secondly, how can I complete the stemming of an entire corpus, should I use a for loop for each document of the corpus's content?

Upvotes: 3

Answers (2)

firefreezing

Reputation: 33

not sure if you have found the solution already. I have been informed by this post stemCompletion is not working and I believe it solves your second questions of "How to complete a stemmed corpus from a dictionary (tm package)?" (as well as mine, which is similar to yours). Specifically, you can try the following code:

    stem_completion <- tm_map(ds.corpus, 
                       content_transformer(function(x, d)
                         paste(stemCompletion(strsplit(stemDocument(x), ' ')[[1]], d), 
                               collapse = ' ')), d = copy)

Upvotes: 1

amitkb3

Reputation: 313

There are 2 things you need to change

When you use a custom function you need to use content_transformer

removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)

ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))
The purpose of the function stemCompletion is to try to complete a stemmed word https://en.wikipedia.org/wiki/Stemming based on a dictionary. The stemmed words need to be a character vector and dictionary can be a corpus.

x <- c("compan", "entit", "suppl") stemCompletion(x, copy)

output:

 compan       entit       suppl

"companies" "" "supplies"

Code to create Document Term Matrix

# Data import
df.imp<- read.csv("data/Phone2_Sample100_NegPos.csv", header = TRUE, as.is = TRUE)

##### Data Pre-Processing 

#install.packages("tm")
require(tm)  

ds.corpus<- Corpus(VectorSource(df.imp$Content))

ds.corpus<- tm_map(ds.corpus, content_transformer(tolower))
ds.corpus<- tm_map(ds.corpus, content_transformer(removePunctuation))
ds.corpus<- tm_map(ds.corpus, content_transformer(removeNumbers))
removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)
ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))


stopwords.default<- stopwords("english")
stopWordsNotDeleted<- c("isn't" ,     "aren't" ,    "wasn't" ,    "weren't"   , "hasn't"    ,
                        "haven't" ,   "hadn't"  ,   "doesn't" ,   "don't"      ,"didn't"    ,
                        "won't"   ,   "wouldn't",   "shan't"  ,   "shouldn't",  "can't"     ,
                        "cannot"    , "couldn't"  , "mustn't", "but","no", "nor", "not", "too", "very")

stopWord.new<- stopwords.default[! stopwords.default %in% stopWordsNotDeleted] ## new Stopwords list
ds.corpus<- tm_map(ds.corpus, removeWords, stopWord.new )

tdm<- TermDocumentMatrix(ds.corpus)

Example to complete stemmed words

copy<- ds.corpus ## creating a copy to be used as a dictionary
x <- c("compan", "entit", "suppl")
stemCompletion(x, copy)

Upvotes: 3

How to complete a stemmed corpus from a dictionary using stemCompletion function (tm package)

Answers (2)

Example to complete stemmed words

Related Questions