KSR
KSR

Reputation: 1

When I load 2.7GB RDS file in R Server, it occupies a space of 27GB. Why is this happening?

I have been using SAS for the past few years, and have recently started using R. I have this dataset(not that huge by SAS standards) which whenever i try to load in R, it occupies almost 10times the size. I use the following code:

assign('rawdf',readRDS(paste0("/srv/data/.../data/als_raw.rds")))
rawdf<- data.table(rawdf)

This load takes approx 10 mins.

PFB some pics attached FYR.

Thats the size of the RDS file

And this is the memory occupied when i load it as datafram

So u can very well see that the 2.7GB dataset ends up using 27.8 GB of RAM! Before you suggest that I convert it to data.table and then use it, i have tried that as well. I used

setdt(rawdf)  #as it uses referencing 

and

rawdf <- data.table(rawdf)   #another way

But in both the cases the session gets hanged due to low memory! Any help would be highly appreciated. Not sure where am i going wrong here..

Upvotes: 0

Views: 1110

Answers (2)

Spacedman
Spacedman

Reputation: 94267

RDS files use compression. Example:

> d = data.frame(a = sample(10,1000000,replace=TRUE))
> saveRDS(d,"d.rds")
> object.size(d)
4000672 bytes
> file.size("d.rds")
[1] 639072
> object.size(d)/file.size("d.rds")
6.26012718441741 bytes

Here I've created a data frame that uses 4000672 bytes in memory and saves to an RDS with 639072 bytes, a factor of 6.26 smaller.

object.size is a much better way of finding out how big an object is in memory than looking at how much RAM R has used with your system tools.

Upvotes: 3

Hack-R
Hack-R

Reputation: 23211

It's very likely just the compressed vs. decompressed size.***

If you save a SAS dataset in compressed form then it's no different there, save for the fact that

(a) R is significantly more efficient in compression relative to sas7bdat (especially in the case of RDS files) and

(b) SAS can't use data entirely in memory (RAM) as is the default in R for speed.

However, R also has the ability to use storage memory like SAS instead of RAM if you choose. There are multiple ways to do this, the most common being ff. Google and Stack Overflow have many examples of ff if you need that. There are also many other Big Data strategies you can employ, like sampling, chunking, etc.

Incidentally, using assign like that is not a good practice, though it's unrelated to your main question. Better to use rawdf <-. Avoids environment and scoping issues. Also, the paste0 was unnecessary.

*** unless that screen shot indicates your session has been running over 12 hours, in which case for the love of R shut it down and reopen a new session.

Upvotes: 2

Related Questions