Reputation: 21
I'm trying to make a model to determine if a review is positive or negative. I've loaded up all my data, tokenized it into a dataframe with the first column being a factor if it's recommended or not.
> str(reviewtokensdf)
'data.frame': 500 obs. of 270 variables:
$ recommend : Factor w/ 2 levels "0","1": 1 2 2 1 2 2 1 2 2 2 ...
$ made : num 3 0 0 0 0 0 1 0 0 0 ...
$ site : num 1 1 0 0 0 0 0 0 0 0 ...
$ use : num 1 0 0 0 1 0 0 0 0 0 ...
$ one : num 2 1 0 0 0 0 0 0 0 0 ...
$ will : num 1 1 1 0 0 0 0 0 0 0 ...
$ make : num 2 1 0 0 1 0 0 0 0 1 ...
$ book : num 6 0 0 0 3 0 0 0 0 0 ...
$ place : num 3 0 0 0 0 1 0 0 0 0 ...
$ stay : num 1 0 0 0 0 0 0 0 0 0 ...
$ night : num 1 0 0 2 0 0 0 0 0 1 ...
$ arriv : num 1 0 0 0 1 0 0 0 0 0 ...
$ small : num 1 0 0 0 0 0 0 0 0 0 ...
$ floor : num 1 0 0 3 0 0 1 0 0 0 ...
Now i've been using a smaller subset (n=500) just for testing purposes but that shouldn't be a problem. I've extensively been using this ( https://medium.com/analytics-vidhya/customer-review-analytics-using-text-mining-cd1e17d6ee4e) tutorial for guidence but i keep running into this problem:
When i use this code:
tree = rpart(formula = recommend ~ ., data = reviewtokensdf, method="class",control = rpart.control(minsplit = 200, minbucket = 30, cp = 0.0001))
printcp(tree)
i expect to see at least some words in the " variables actually used in tree construction: section but it keeps staying on 0 and i have no clue why.
Classification tree:
rpart(formula = recommend ~ ., data = reviewtokensdf, method = "class",
control = rpart.control(minsplit = 200, minbucket = 30, cp = 1e-04))
Variables actually used in tree construction:
character(0)
Root node error: 40/500 = 0.08
n= 500
CP nsplit rel error xerror xstd
1 0 0 1 0 0
i tried breaking down the rpart arguments to just the basics (so taking off the rpart.control etc) no dice. I tried things like reviewtokensdf$recommended in the formula field, same result.
When i run the example data from the guide i mentioned, it's all fine and dandy. Yet i can't see a difference.
Upvotes: 1
Views: 442
Reputation: 37641
The problem is with your rpart.control
. It may have been well adjusted
when you have the full data set with thousands of documents, but with only
500, these are bad choices. Try
rpart.control(minsplit = 20, minbucket = 5, cp = 0.01)
and you will probably get some nodes split. I am NOT saying that these are good choices, but they would be a better starting place. See what happens and adjust.
Upvotes: 1