Reputation: 13
I am trying to do a calculation on each row of a data frame in R and append the calculation as a new column in the frame. I started using the "by" function but it was incredibly slow doing the calculation so I switched to the "apply" function instead. The way I imagine it will work is by running apply with my function, saving the output to a variable and appending that data to the original data frame.
I created a function to calculate the term length of an insurance plan and return that value, which works fine on a sample data set. When I use the my larger data set, I get an error of "cannot allocate vector of size ... ". I know many people recommend getting more RAM but I already have 16GB of memory and with the entire data set loaded in R my computer says it is using only 7.7GB of memory. The data set has 44 columns with ~11 million records so I'm not seeing how adding one more column of data takes up 8GB of memory?
Any point in the right direction would be great.
Below is the function I am using:
get_term_length <- function(row_data){
# convert values to dates
expiration_date <- as.Date( row_data[42] )
start_date <- as.Date( row_data[43] )
cancellation_date <- as.Date( row_data[44] )
# check to see if the cancellation date is NA - just use entire policy length
if( is.na(cancellation_date) ){
return( expiration_date - start_date) )
}
# check to see if policy was cancelled early
if(cancellation_date < expiration_date){
return( cancellation_date - start_date )
}
# the policy was for the entire term
else{
return( expiration_date - start_date )
}
}
I have been running the function by calling:
tmp <- apply(policy_data, 1, get_term_length)
Upvotes: 1
Views: 2030
Reputation: 115425
A data.table
solution as hinted at by @Dwin
library(data.table)
policy_data <- as.data.table(policy_data)
# set the date columns to be IDate (the exact form of this will depend
# on the format they are currently in
policy_data[, cancellation_date := as.IDate(cancellation_date)]
policy_data[, start_date := as.IDate(start_date)]
policy_data[, end_date := as.IDate(end_date)]
# create a column which is an indicator for NA
policy_data[, isna := is.na(cancellation_date)]
setkey(policy_data, isna)
policy_data[J(TRUE), tmp := expiration_date - start_date]
policy_data[J(FALSE), tmp := pmin(cancellation_date - start_date, expiration_date-start_date)]
Upvotes: 5
Reputation: 263411
The "rule of thumb" is that you will need _at_least_ 3 times as much contiguous RAM as the largest object you are working with, so you may need to do more frequent restarts and limit the number of other running applications on your system. I am reasonably happy with my hardware the has 32 GB and am working on regression models based on data-objects that are 5-6 GB in size.
(I, too, am generally unhappy with the time it takes to add a column of data, so I sometimes create 'outside' or 'parallel' vectors, often using the ave() function, for table operations. Or I work with subset()-s of the data.) The R gurus are working on this without an announced solution as yet. Matthew Dowle has developed the 'data.table' package that solves it (and is remarkably faster) at the expense of needing different semantics for the "[" operation.
You can find other posts here that discuss memory requirements. The most frequently voted one is probably: Tricks to manage the available memory in an R session . I'm a bit surprised that the data.table package is not mentioned anywhere in the answers there.
Edit: I saw the answer you accepted (which is really an instance of working with a subset of the data), but I think it is not the same result as you were getting before. To get that result, you would need something like:
tmp <- with(policy_data,
ifelse( is.na(cancellation_date), expiration_date - start_date , # else
pmin(as.Date(cancellation_date)-as.Date(start_date),
as.Date(expiration_date)-as.Date(start_date)
))
You do not want to use "na.rm=TRUE" in pmin
if you want the ifelse
function to work properly.
Upvotes: 4
Reputation: 823
Doesn't look like you need the whole data.frame. Just extract the 3 columns you need, and you have most of your memory back.
tmp <- with(policy_data,
pmin(as.Date(cancellation_date)-as.Date(start_date),
as.Date(expiration_date)-as.Date(start_date),
na.rm=TRUE))
Upvotes: 3