user1588659
user1588659

Reputation: 13

Apply function in R causes memory allocation error

I am trying to do a calculation on each row of a data frame in R and append the calculation as a new column in the frame. I started using the "by" function but it was incredibly slow doing the calculation so I switched to the "apply" function instead. The way I imagine it will work is by running apply with my function, saving the output to a variable and appending that data to the original data frame.

I created a function to calculate the term length of an insurance plan and return that value, which works fine on a sample data set. When I use the my larger data set, I get an error of "cannot allocate vector of size ... ". I know many people recommend getting more RAM but I already have 16GB of memory and with the entire data set loaded in R my computer says it is using only 7.7GB of memory. The data set has 44 columns with ~11 million records so I'm not seeing how adding one more column of data takes up 8GB of memory?

Any point in the right direction would be great.


Below is the function I am using:

get_term_length <- function(row_data){

    # convert values to dates
    expiration_date <- as.Date( row_data[42] )
    start_date <- as.Date( row_data[43] )
    cancellation_date <- as.Date( row_data[44] )

    # check to see if the cancellation date is NA - just use entire policy length
    if( is.na(cancellation_date) ){
        return( expiration_date - start_date) )
    }

    # check to see if policy was cancelled early
    if(cancellation_date < expiration_date){
        return( cancellation_date - start_date )
    }

    # the policy was for the entire term
    else{
        return( expiration_date - start_date )
    }

}

I have been running the function by calling:

tmp <- apply(policy_data, 1, get_term_length)

Upvotes: 1

Views: 2030

Answers (3)

mnel
mnel

Reputation: 115425

A data.table solution as hinted at by @Dwin

 library(data.table)
 policy_data <- as.data.table(policy_data)

  # set the date  columns to be  IDate (the exact form of this will depend
  # on the format they are currently in

  policy_data[, cancellation_date := as.IDate(cancellation_date)]
  policy_data[, start_date := as.IDate(start_date)]
  policy_data[, end_date := as.IDate(end_date)]
  # create a column which is an indicator for NA 

  policy_data[, isna := is.na(cancellation_date)]


  setkey(policy_data, isna)

  policy_data[J(TRUE), tmp := expiration_date - start_date]
  policy_data[J(FALSE), tmp := pmin(cancellation_date - start_date, expiration_date-start_date)]   

Upvotes: 5

IRTFM
IRTFM

Reputation: 263411

The "rule of thumb" is that you will need _at_least_ 3 times as much contiguous RAM as the largest object you are working with, so you may need to do more frequent restarts and limit the number of other running applications on your system. I am reasonably happy with my hardware the has 32 GB and am working on regression models based on data-objects that are 5-6 GB in size.

(I, too, am generally unhappy with the time it takes to add a column of data, so I sometimes create 'outside' or 'parallel' vectors, often using the ave() function, for table operations. Or I work with subset()-s of the data.) The R gurus are working on this without an announced solution as yet. Matthew Dowle has developed the 'data.table' package that solves it (and is remarkably faster) at the expense of needing different semantics for the "[" operation.

You can find other posts here that discuss memory requirements. The most frequently voted one is probably: Tricks to manage the available memory in an R session . I'm a bit surprised that the data.table package is not mentioned anywhere in the answers there.

Edit: I saw the answer you accepted (which is really an instance of working with a subset of the data), but I think it is not the same result as you were getting before. To get that result, you would need something like:

tmp <- with(policy_data, 
          ifelse( is.na(cancellation_date), expiration_date - start_date , # else
            pmin(as.Date(cancellation_date)-as.Date(start_date),
                 as.Date(expiration_date)-as.Date(start_date)
                 ))

You do not want to use "na.rm=TRUE" in pmin if you want the ifelse function to work properly.

Upvotes: 4

ppham27
ppham27

Reputation: 823

Doesn't look like you need the whole data.frame. Just extract the 3 columns you need, and you have most of your memory back.

tmp <- with(policy_data, 
            pmin(as.Date(cancellation_date)-as.Date(start_date),
                 as.Date(expiration_date)-as.Date(start_date),
                 na.rm=TRUE))

Upvotes: 3

Related Questions