Miguel Vazq
Miguel Vazq

Reputation: 1489

Loading ffdf data take a lot of memory

I am facing a strange problem: I save ffdf data using

save.ffdf()

from ffbase package and when i load them in a new R session, doing

load.ffdf("data.f") 

it gets loaded into RAM aprox 90% of the memory than the same data as a data.frame object in R. Having this issue, it does not make a lot of sense to use ffdf, isn´t it? I can't use ffsave because i am working in a server and do not have the zip app on it.

packageVersion(ff) # 2.2.10
packageVersion(ffbase) # 0.6.3

Any ideas about ?

[edit] some code example to help to clarify:

data <- read.csv.ffdf(file = fn, header = T, colClasses = classes) 
# file fn is a csv database with 5 columns and 2.6 million rows,
# with some factor cols  and some integer cols. 
data.1 <- data 
save.ffdf(data.1 , dir = my.dir) # my.dir is a string pointing to the file. "C:/data/R/test.f" for example. 

closing the R session... opening again:

load.ffdf(file.name) # file.name is a string pointing to the file. 
#that gives me object data, with class(data) = ffdf. 

then i have a data object ffdf[5] , and its memory size is almost as big as:

data.R <- data[,] # which is a data.frame. 

[end of edit]

*[ SECOND EDIT :: FULL REPRODUCIBLE CODE ::: ]

As my question is not answered yet, and i still find the problem, i give a reproducible example ::

dir1 <- 'P:/Projects/RLargeData';
setwd(dir1);
library(ff)
library(ffbase)

memory.limit(size=4000)
N = 1e7; 
df <- data.frame( 
 x = c(1:N), 
 y = sample(letters, N, replace =T), 
 z = sample( as.Date(sample(c(1:2000), N, replace=T), origin="1970-01-01")),
 w = factor( sample(c(1:N/10) , N, replace=T))   )

df[1:10,]
dff <- as.ffdf(df)
head(dff)
#str(dff)

save.ffdf(dff, dir = "dframeffdf")
dim(dff)
# on disk, the directory "dframeffdf" is : 205 MB (215.706.264 bytes)

### resetting R :: fresh RStudio Session 
dir1 <- 'P:/Projects/RLargeData';
setwd(dir1);
library(ff)
library(ffbase)
memory.size() # 15.63 
load.ffdf(dir = "dframeffdf")
memory.size() # 384.42
gc()
memory.size() # 287

So we have into memory 384 Mb, and after gc() there are 287, which is around the size of the data in the disk. (checked also in "Process explorer" application for windows)

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252    LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C                    LC_TIME=Danish_Denmark.1252    

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ffbase_0.7-1 ff_2.2-10    bit_1.1-9

[END SECOND EDIT ]

Upvotes: 1

Views: 1792

Answers (2)

user1600826
user1600826

Reputation:

In ff, when you have factor columns, the factor levels are always in RAM. ff character columns currently don't exist and character columns are converted to factors in an ffdf.

Regarding your example: your 'w' column in 'dff' contains more than 6 Mio levels. These levels are all in RAM. If you wouldn't have columns with a lot of levels, you wouldn' see the RAM increase as shown below using your example.

N = 1e7; 
df <- data.frame( 
 x = c(1:N), 
 y = sample(letters, N, replace =T), 
 z = sample( as.Date(sample(c(1:2000), N, replace=T), origin="1970-01-01")),
 w = sample(c(1:N/10) , N, replace=T))   
dff <- as.ffdf(df)
save.ffdf(dff, dir = "dframeffdf")

### resetting R :: fresh RStudio Session 
library(ff)
library(ffbase)
memory.size() # 14.67
load.ffdf(dir = "dframeffdf")
memory.size() # 14.78

Upvotes: 2

IRTFM
IRTFM

Reputation: 263471

The ffdf package(s) have mechanisms for segregating object in 'physical' and 'virtual' storage. I suspect you are implicitly constructing items in physical memory, but since you offer not coding for how this workspace was created, there's only so much guessing that is possible.

Upvotes: 0

Related Questions