Why are matrices (in R) so much slower and larger than image files that contain the same data?

Question

I am working with raw imaging mass spectrometry data. This kind of data is very similar to a traditional image file, except that rather than 3 colour channels, we have channels corresponding to the number of ions we are measuring (in my case, 300). The data is originally stored in a proprietary format, but can be exported to a .txt file as a table with the format:

x, y, z, i (intensity), m (mass)

As you can imagine, the files can be huge. A typical image might be 256 x 256 x 20, giving 1310720 pixels. If each has 300 mass channels, this gives a table with 393216000 rows and 5 columns. This is huge! And consequently won't fit into memory. Even if I select smaller subsets of the data (such as a single mass), the files are very slow to work with. By comparison, the proprietary software is able to load up and work with these files extremely quickly, for example just taking a second or two to open up a file into memory.

I hope I have made myself clear. Can anyone explain this? How can it be that two files containing essentially the exact same data can have such different sizes and speeds? How can I work with a matrix of image data much faster?

Severin Pappadeux · Accepted Answer

Can anyone explain this?

Yep

How can it be that two files containing essentially the exact same data can have such different sizes and speeds?

R is using doubles are default numeric type. Thus, just a storage for your data frame is about 16Gb. Proprietary software most likely is using float as underlying type, thus cutting the memory requirements to 8Gb.

How can I work with a matrix of image data much faster?

Buy a computer with 32Gb. Even with 32Gb computer, think about using data.table in R with operations done via references, because R likes to copy data frames.

Or you might want to move to Python/pandas for processing, with explicit use of dtype=float32

UPDATE

If you want to stay with R, take a look at bigmemory package, link, though I would say dealing with it is not for a people with weak heart

Why are matrices (in R) so much slower and larger than image files that contain the same data?

Answers (2)

Related Questions