Dinal24
Dinal24

Reputation: 3192

An error occurs when calling rpart for a large data set

I have a large data set which has 100k data fields. When I try str() or view the full data no glitched occurs, but when I run rpart on the training set it takes sometime and after about 3-4 minutes it shows up the following error,

Error: Unable to establish connection with R session

My script looks like below:

# Decision tree
library(rpart)                      
library(rattle)                                 
library(party)  

train_set <- read.table('my_sample_trainset.csv', header=TRUE, sep=',', stringsAsFactors=FALSE)
test_set <- read.table('my_sample_testset.csv', header=TRUE, sep=',', stringsAsFactors=FALSE)

my_trained_tree <- rpart(Route ~ Bus_Id + week_days + time_slot, data=train_set, method="class")
# Error occurs on/after this line

my_prediction <- predict(my_trained_tree, test_set, type = "class")

my_solution <- data.frame(Route = my_prediction)

write.csv(my_solution, file = "solution.csv", row.names = FALSE)

Am I missing a library? or does this happen because of the big data set (6.5MB)

Further, I am using rStudio version 0.99.447 on a Mac OS X Yosemite

Upvotes: 2

Views: 1409

Answers (1)

Chris Kennedy
Chris Kennedy

Reputation: 349

That message means that R is still calculating the results. If you open Activity Monitor and sort by CPU usage on the CPU tab, you should see that rsession is using 100% of a CPU. So you can just click "ok" on that message and allow R to keep computing.

I wish there were a workaround though, this issue is plaguing me as we speak!

Upvotes: 1

Related Questions