Bezulba
Bezulba

Reputation: 21

Actual characters used (0) when using rpart() in R

I'm trying to make a model to determine if a review is positive or negative. I've loaded up all my data, tokenized it into a dataframe with the first column being a factor if it's recommended or not.

> str(reviewtokensdf)
'data.frame':   500 obs. of  270 variables:
 $ recommend       : Factor w/ 2 levels "0","1": 1 2 2 1 2 2 1 2 2 2 ...
 $ made            : num  3 0 0 0 0 0 1 0 0 0 ...
 $ site            : num  1 1 0 0 0 0 0 0 0 0 ...
 $ use             : num  1 0 0 0 1 0 0 0 0 0 ...
 $ one             : num  2 1 0 0 0 0 0 0 0 0 ...
 $ will            : num  1 1 1 0 0 0 0 0 0 0 ...
 $ make            : num  2 1 0 0 1 0 0 0 0 1 ...
 $ book            : num  6 0 0 0 3 0 0 0 0 0 ...
 $ place           : num  3 0 0 0 0 1 0 0 0 0 ...
 $ stay            : num  1 0 0 0 0 0 0 0 0 0 ...
 $ night           : num  1 0 0 2 0 0 0 0 0 1 ...
 $ arriv           : num  1 0 0 0 1 0 0 0 0 0 ...
 $ small           : num  1 0 0 0 0 0 0 0 0 0 ...
 $ floor           : num  1 0 0 3 0 0 1 0 0 0 ...

Now i've been using a smaller subset (n=500) just for testing purposes but that shouldn't be a problem. I've extensively been using this ( https://medium.com/analytics-vidhya/customer-review-analytics-using-text-mining-cd1e17d6ee4e) tutorial for guidence but i keep running into this problem:

When i use this code:

tree = rpart(formula = recommend ~ ., data = reviewtokensdf,  method="class",control = rpart.control(minsplit = 200,  minbucket = 30, cp = 0.0001))
printcp(tree)

i expect to see at least some words in the " variables actually used in tree construction: section but it keeps staying on 0 and i have no clue why.

    Classification tree:
    rpart(formula = recommend ~ ., data = reviewtokensdf, method = "class", 
        control = rpart.control(minsplit = 200, minbucket = 30, cp = 1e-04))

    Variables actually used in tree construction:
    character(0)

    Root node error: 40/500 = 0.08

    n= 500 

      CP nsplit rel error xerror xstd
    1  0      0         1      0    0

i tried breaking down the rpart arguments to just the basics (so taking off the rpart.control etc) no dice. I tried things like reviewtokensdf$recommended in the formula field, same result.

When i run the example data from the guide i mentioned, it's all fine and dandy. Yet i can't see a difference.

Upvotes: 1

Views: 442

Answers (1)

G5W
G5W

Reputation: 37641

The problem is with your rpart.control. It may have been well adjusted when you have the full data set with thousands of documents, but with only 500, these are bad choices. Try

rpart.control(minsplit = 20, minbucket = 5, cp = 0.01)

and you will probably get some nodes split. I am NOT saying that these are good choices, but they would be a better starting place. See what happens and adjust.

Upvotes: 1

Related Questions