Felix Zhao
Felix Zhao

Reputation: 489

Random Forest with caret package: Error: cannot allocate vector of size 153.1 Gb

I was trying to build a random forest model for a dataset in Kaggle, i always doing machine learning with caret package, the dataset has 1.5 million + rows and 46 variables with no missing values (about 150 mb in size), 40+ variables are categorical and the outcome is the response i am trying to predict and it is binary. After some pre-processing with dplyr, I started working on building model with caret package, but i got this error message when i was trying to run the "train" function:"Error: cannot allocate vector of size 153.1 Gb" Here is my code:

## load packages
require(tidyr)
require(dplyr)
require(readr)
require(ggplot2)
require(ggthemes)
require(caret)
require(parallel)
require(doParallel)

## prepare for parallel processing
n_Cores <- detectCores()
n_Cluster <- makeCluster(n_Cores)
registerDoParallel(n_Cluster)

## import orginal datasets
people_Dt <- read_csv("people.csv",col_names = TRUE)
activity_Train <- read_csv("act_train.csv",col_names = TRUE)

### join two sets together and remove variables not to be used
first_Try <- people_Dt%>%
   left_join(activity_Train,by="people_id")%>%
   select(-ends_with("y"))%>%
   filter(!is.na(outcome))

## try with random forest
in_Tr <- createDataPartition(first_Try$outcome,p=0.75,list=FALSE)
rf_Train <- firt_Try[in_Tr,]
rf_Test <- firt_Try[-in_Tr,]
## set model cross validation parameters
model_Control <- trainControl(method = "repeatedcv",repeats=2,number=2,allowParallel = TRUE)
rf_RedHat <- train(outcome~.,
               data=rf_Train,
               method="rf",
               tuneLength=10,
               importance=TRUE,
               trControl=model_Control)

My computer is a fairly powerful machine with E3 processors and 32GB RAM. I have two questions: 1. Where did i get a vector that is as large as 150GB? Is it because some codes I wrote? 2. I cannot get a machine with that big ram, is there any workarouds to solve the issue that i can move on with my model building process?

Upvotes: 4

Views: 2387

Answers (2)

Cartoni
Cartoni

Reputation: 38

The problem is probably related to the one-hot-encoding of caret in your categorical variables. Since you have a lot of categorical variables, this seems to be a real problem such that it increases your dataset in a huge way. One-hot encoding will create a new column for every factor per categorical variables that you have.

Maybe you could try something like the h2o-package, which handles categorical variable in another way such that in not exploding your dataset when the model is run.

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521249

the dataset has 1.5 million + rows and 46 variables with no missing values (about 150 mb in size)

To be clear here, you most likely don't need 1.5 million rows to build a model. Instead, you should be taking a smaller subset which doesn't cause the memory problems. If you are concerned about reducing the size of your sample data, then you can do some descriptive stats on the 40 predictors, on a smaller set, and make sure that the behavior appears to be the same.

Upvotes: 3

Related Questions