Reputation: 1
I have been using SAS for the past few years, and have recently started using R. I have this dataset(not that huge by SAS standards) which whenever i try to load in R, it occupies almost 10times the size. I use the following code:
assign('rawdf',readRDS(paste0("/srv/data/.../data/als_raw.rds")))
rawdf<- data.table(rawdf)
This load takes approx 10 mins.
PFB some pics attached FYR.
Thats the size of the RDS file
And this is the memory occupied when i load it as datafram
So u can very well see that the 2.7GB dataset ends up using 27.8 GB of RAM! Before you suggest that I convert it to data.table and then use it, i have tried that as well. I used
setdt(rawdf) #as it uses referencing
and
rawdf <- data.table(rawdf) #another way
But in both the cases the session gets hanged due to low memory! Any help would be highly appreciated. Not sure where am i going wrong here..
Upvotes: 0
Views: 1110
Reputation: 94267
RDS files use compression. Example:
> d = data.frame(a = sample(10,1000000,replace=TRUE))
> saveRDS(d,"d.rds")
> object.size(d)
4000672 bytes
> file.size("d.rds")
[1] 639072
> object.size(d)/file.size("d.rds")
6.26012718441741 bytes
Here I've created a data frame that uses 4000672 bytes in memory and saves to an RDS with 639072 bytes, a factor of 6.26 smaller.
object.size
is a much better way of finding out how big an object is in memory than looking at how much RAM R has used with your system tools.
Upvotes: 3
Reputation: 23211
It's very likely just the compressed vs. decompressed size.***
If you save a SAS dataset in compressed form then it's no different there, save for the fact that
(a) R is significantly more efficient in compression relative to sas7bdat (especially in the case of RDS files) and
(b) SAS can't use data entirely in memory (RAM) as is the default in R for speed.
However, R also has the ability to use storage memory like SAS instead of RAM if you choose. There are multiple ways to do this, the most common being ff
. Google and Stack Overflow have many examples of ff
if you need that. There are also many other Big Data strategies you can employ, like sampling, chunking, etc.
Incidentally, using assign
like that is not a good practice, though it's unrelated to your main question. Better to use rawdf <-
. Avoids environment and scoping issues. Also, the paste0
was unnecessary.
*** unless that screen shot indicates your session has been running over 12 hours, in which case for the love of R shut it down and reopen a new session.
Upvotes: 2