JordanBelf
JordanBelf

Reputation: 3338

Why loading a model takes so much time for me in R?

For a personal project I need to run several machine learning algorithms against different texts in order to classify them.

I used to do this using RapidMiner but I decided to move all my development to R as I feel I have more control with it.

The issue I am seeing now (which I did not notice with RapidMiner) is that loading the models is taking a lot of time.

For example:

I have a model which checks if it the text refers to sports. The model is 37.7 MB and it takes 8:34 with my 2.2 GH i7 Mac with 4GB of RAM

The way I am calling the model is the following:

fileNameMatrix = paste(query,query1,"-matrix.Rd", sep ="")
fileNameModel= paste(query,query1,"-model.Rd", sep ="")

load(fileNameMatrix)
load(fileNameModel)

The model was generated using RTextTools

Those query variables you read are because I need to call almost 20 models and compare them against different datasets. That is why although 8 minutes is not a lot, when I read all of them its almost 3 hours just on loading which makes my task almost useless considering its an almost real time task.

Which factors should I consider to reduce loading time if reducing the size of the model is not an option?

One other thing I consider suspicious is that while the matrix file is rather small 64KB the model is still 37.7MB. Is it possible that the model file is bigger than necessary? Have anyone experienced something similar using RTextTools?

This is one of my firsts tasks using models in R so excuse me if I am doing somethings which is obviously wrong.

Thanks a lot for your time and any tip in the right direction will be much appreciated!

Upvotes: 1

Views: 1022

Answers (2)

jclancy
jclancy

Reputation: 52318

Have you checked the RAM usage in your Activity Monitor? Compressed RData files are relatively tiny, but they uncompress to be massive. For instance, an n x n matrix of all 0's will take up essentially no space for any n (that may explain your small matrix size). Your loaded model might then be huge; I have some RData files that amount to maybe 200 MB but that cannot be loaded in memory in R. This could become a problem if you're running low on RAM, as your computer may attempt to use drive space to load the files.

Upvotes: 2

ozjimbob
ozjimbob

Reputation: 300

I'm not familiar with the model output from RTextTools, but it's pretty common for a model object to be significantly larger than the input data frame. For example, the output of a glm contain all the input data, as well as predicted values, residuals, coefficients, errors, you name it. The output of a RandomForest model contains the input data as well as definitions for thousands of trees etc.

How does the loading time of the models compare to running them from scratch? Have you looked at what's in the model object to see what it contains, with the possibility of pruning off any statistics you don't need?

str(fileNameModel)

Upvotes: 2

Related Questions