Rakesh
Rakesh

Reputation: 79

Decision tree in r is not forming with my training data

library(caret)
library(rpart.plot)
car_df <- read.csv("TrainingDataSet.csv", sep = ',', header = TRUE)
str(car_df)

set.seed(3033)
intrain <- createDataPartition(y = car_df$Result, p= 0.7, list = FALSE)
training <- car_df[intrain,]
testing <- car_df[-intrain,]
dim(training)
dim(testing)
anyNA(car_df)
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
dtree_fit <- train(Result ~., data = training, method = "rpart",
               parms = list(split = "infromation"),
               trControl=trctrl,
               tuneLength = 10)

I get this warning:

Warning message: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.

I am trying to classify whether a movie is hit or flop using number of positive and negative sentiment. Here my data

  dput(car_df) 

structure(list(MovieName = structure(c(20L, 5L, 31L, 26L, 27L, 
12L, 36L, 29L, 38L, 4L, 6L, 8L, 10L, 15L, 18L, 21L, 24L, 34L, 
35L, 7L, 37L, 25L, 23L, 2L, 11L, 40L, 33L, 28L, 14L, 3L, 17L, 
16L, 32L, 22L, 30L, 1L, 19L, 39L, 9L, 13L), .Label = c("#96Movie", 
"#alphamovie", "#APrivateWar", "#AStarIsBorn", "#BlackPanther", 
"#BohemianRhapsody", "#CCV", "#Creed2", "#CrimesOfGrindelwald", 
"#Deadpool2", "#firstman", "#GameNight", "#GreenBookMovie", "#grinchmovie", 
"#Incredibles2", "#indivisiblemovie", "#InstantFamily", "#JurassicWorld", 
"#KolamaavuKokila", "#Oceans8", "#Overlord", "#PariyerumPerumal", 
"#RalphBreaksTheInternet", "#Rampage", "#Ratchasan", "#ReadyPlayerOne", 
"#RedSparrow", "#RobinHoodMovie", "#Sarkar", "#Seemaraja", "#Skyscraper", 
"#Suspiria", "#TheLastKey", "#TheNun", "#ThugsOfHindostan", "#TombRaider", 
"#VadaChennai", "#Venom", "#Vishwaroopam2", "#WidowsMovie"), class = "factor"), 
    PositivePercent = c(40.10554, 67.65609, 80.46796, 71.34831, 
    45.36082, 68.82591, 46.78068, 63.85787, 47.20497, 32.11753, 
    63.7, 39.2, 82.76553, 88.78613, 72.18274, 72.43187, 31.0089, 
    38.50932, 38.9, 19.9, 84.26854, 29.4382, 58.13953, 86.9281, 
    64.54965, 56, 0, 56.61914, 58.82353, 54.98891, 78.21682, 
    90, 64.3002, 85.8, 51.625, 67.71894, 92.21557, 53.84615, 
    40.12158, 68.08081), NegativePercent = c(11.34565, 21.28966, 
    6.408952, 13.10861, 26.80412, 17.10526, 18.61167, 10.55838, 
    46.48033, 56.231, 9.9, 12.1, 9.018036, 6.473988, 13.90863, 
    16.77149, 63.20475, 42.54658, 40.9, 5.4, 3.907816, 2.022472, 
    10.51567, 3.267974, 15.12702, 15.3, 100, 18.12627, 11.76471, 
    13.41463, 5.775076, 10, 20.08114, 2.1, 5.5, 7.739308, 0, 
    34.61538, 12.86727, 10.70707), Result = structure(c(2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
    1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Flop", "Hit"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
-40L))

Upvotes: 0

Views: 231

Answers (1)

IRTFM
IRTFM

Reputation: 263342

> str(car_df)
'data.frame':   40 obs. of  4 variables:
 $ MovieName      : Factor w/ 40 levels "#96Movie","#alphamovie",..: 20 5 31 26 27 12 36 29 38 4 ...
 $ PositivePercent: num  40.1 67.7 80.5 71.3 45.4 ...
 $ NegativePercent: num  11.35 21.29 6.41 13.11 26.8 ...
 $ Result         : Factor w/ 2 levels "Flop","Hit": 2 2 2 2 2 2 2 2 2 1 ...

> with(car_df, table( Result))
Result
Flop  Hit 
   5   35 

 > dtree_fit
CART 

29 samples
 3 predictor
 2 classes: 'Flop', 'Hit' 

So you have an outcome with 5 flops, and one of the predictors is a variable with 40 different values. This does not seem surprising given that each of your cases is unique and you have severely unbalanced outcome. The presence of data does not guarantee the possibility of substantial conclusions. If there's any error here, it's the lack of code in the fitter that would say something along the lines of "Really? You think statistical packages should be able to solve a severe lack of data?"

BTW: should be (but unsurprisingly doesn't clear the warning):

(split = "information")

If you change the number of cross-validation bins to a number that would allow flops to be distributed among the various bins then you can get a non-warning result. Whether it will have much validity remains questionable, given the small sample size:

> trctrl <- trainControl(method = "repeatedcv", number = 3, repeats = 3)
 set.seed(3333)
 dtree_fit <- train(Result ~., data = training, method = "rpart",
                    parms = list(split = "infromation"),
                    trControl=trctrl,
                    tuneLength = 10)
# no warning on one of my runs

Upvotes: 1

Related Questions