SixSigma
SixSigma

Reputation: 3260

Work on data that larger than physical RAM in R using bigmemory?

I am developing an R package called biglasso that fits lasso models in R for massive data sets by using memory-mapping techniques implemented in bigmemory C++ library. Specifically, for a very large dataset (say 10GB), a file-backed big.matrix is first created with memory-mapped files stored on disk. Then the model fitting algorithm accesses the big.matrix via MatrixAccessor defined in the C++ library to obtain data for computation. I assume that memory-mapping technique allows to work on data that larger than available RAM, as mentioned in the bigmemory paper.

For my package, everything works great at this point if the data size doesn't exceed available RAM. However, the code runs like forever when the data is larger than RAM, no complains, no errors, no stop. On Mac, I checked top command, and noticed that the status of this job kept switching between "sleeping" and "running", I am not sure what this means or this indicates something going on.

[EDIT:]

By "cannot finish", "run forever", I mean that: working on 18 GB of data with 16 GB RAM cannot finish for over 1.5 hours, but it could be done within 5 minutes if with 32 GB RAM.

[END EDIT]

Questions:

(1) I basically understand memory-mapping utilizes virtual memory so that it can handle data larger than RAM. But how much memory does it need to deal with larger-than-RAM objects? Is there a upper bound? Or is it decided by size of virtual memory? Since the virtual memory is like infinite (constrained by hard drive), would that mean memory-mapping approach can handle data much much larger than physical RAM?

(2) Is there a way that I can measure the memory used in physical RAM and the virtual memory used, separately?

(3) Are there anything that I am doing wrong? what are possible reasons for my problems here?

Really appreciate any feedback! Thanks in advance.


Below are some details of my experiments on Mac and Windows and related questions.

  1. On Mac OS: Physical RAM: 16GB; Testing data: 18GB. Here is the screenshot of memory usage. The code cannot finish.

enter image description here

[EDIT 2]

enter image description here

I attached CPU usage and history here. There is just use one single core for the R computation. It's strange that the system uses 6% CPU, while User just uses 3%. And from the CPU history window, there are a lot of red area.

Question: What does this suggest? Now I suspect it is the CPU cache is filled up. Is that right? If so, how could I resolve this issue?

[END EDIT 2]

Questions:

(4) As I understand, "memory" column shows the memory used in physical RAM, while "real memory" column shows the total memory usage, as pointed out here. Is that correct? The memory used always shows ~ 2GB, so I don't understand why so much memory in RAM is not used.

(5) A minor question. As I observed, it seems that "memory used" + "Cache" must be always less than "Physical memory" (in the bottom middle part). Is this correct?


  1. On Windows machine: Physical RAM: 8GB; Testing data: 9GB. What I observed was that as my job started, the memory usage kept increasing until hitting the limit. The job cannot finish as well. I also tested functions in biganalytics package (also using bigmemory), and found the memory blows up too.

enter image description here

Upvotes: 2

Views: 2501

Answers (2)

Bimal Parakkal
Bimal Parakkal

Reputation: 21

In Windows, it might be useful to checkout TaskManager -> Performance-> Resourcemonitor -> Disk activity to see how much data is being written to disk by your process id. It can give an idea of how much data is being to Virtual memory from RAM , if write speed is becoming a bottle neck etc

Upvotes: 2

antlersoft
antlersoft

Reputation: 14786

"Cannot finish" is ambiguous here. It may be that your computation will complete, if you wait long enough. When you work with virtual memory, you page it on and off disk, which is thousands to millions times slower than keeping it in RAM. The slowdown you will see depends on how your algorithm accesses memory. If your algorithm only visits each page one time in a fixed order, it might not take too long. If your algorithm is hopping all around your data structure O(n^2) times, the paging is going to slow you down so much it might not be practical to complete.

Upvotes: 2

Related Questions