D Greenwood
D Greenwood

Reputation: 446

Why are matrices (in R) so much slower and larger than image files that contain the same data?

I am working with raw imaging mass spectrometry data. This kind of data is very similar to a traditional image file, except that rather than 3 colour channels, we have channels corresponding to the number of ions we are measuring (in my case, 300). The data is originally stored in a proprietary format, but can be exported to a .txt file as a table with the format:

x, y, z, i (intensity), m (mass)

As you can imagine, the files can be huge. A typical image might be 256 x 256 x 20, giving 1310720 pixels. If each has 300 mass channels, this gives a table with 393216000 rows and 5 columns. This is huge! And consequently won't fit into memory. Even if I select smaller subsets of the data (such as a single mass), the files are very slow to work with. By comparison, the proprietary software is able to load up and work with these files extremely quickly, for example just taking a second or two to open up a file into memory.

I hope I have made myself clear. Can anyone explain this? How can it be that two files containing essentially the exact same data can have such different sizes and speeds? How can I work with a matrix of image data much faster?

Upvotes: 0

Views: 485

Answers (2)

D Greenwood
D Greenwood

Reputation: 446

The answer to this question turned out to be a little esoteric and pretty specific to my data-set, but may be of interest to others. My data is very sparse - i.e. most of the values in my matrix are zero. Therefore, I was able to significantly reduce the size of my data using the Matrix package (capitalisation important), which is designed to more efficiently handle sparse matrices. To implement the package, I just inserted the line:

data <- Matrix(data)

The amount of space saved will vary depending on the sparsity of the dataset, but in my case I reduced 1.8 GB to 156 Mb. A Matrix behaves just as a matrix, so there was no need to change my other code, and there was no noticeable change in speed. Sparsity is obviously something that the proprietary format could take advantage of.

Upvotes: 0

Severin Pappadeux
Severin Pappadeux

Reputation: 20110

Can anyone explain this?

Yep

How can it be that two files containing essentially the exact same data can have such different sizes and speeds?

R is using doubles are default numeric type. Thus, just a storage for your data frame is about 16Gb. Proprietary software most likely is using float as underlying type, thus cutting the memory requirements to 8Gb.

How can I work with a matrix of image data much faster?

Buy a computer with 32Gb. Even with 32Gb computer, think about using data.table in R with operations done via references, because R likes to copy data frames.

Or you might want to move to Python/pandas for processing, with explicit use of dtype=float32

UPDATE

If you want to stay with R, take a look at bigmemory package, link, though I would say dealing with it is not for a people with weak heart

Upvotes: 1

Related Questions