Choice of Dimension on Numpy Arrays

Question

I have a dataset I wish to analyze. It consists of

measurements, total number m, which is roughly 2 000 000.
each measurement contains v variables. (About 10 in this case)

I can name each variable (foo, bar, etc) and choose each of them a datatype, such as uint32.

I want to calculate statistics (such as the mean value of foo) and draw plots (such as scatterplot of foo vs bar).

Which representation should I choose for my data (or does it even matter)?

an array of m rows and v columns or
an array of v rows and m columns

I would generally expect better performance if each variable is assigned a continuous block of memory, but will be happy to be proven wrong if someone provides a clear answer.

Bonus question: Best way to iteratively build the array?

Erotemic · Accepted Answer

First things first: Remember that premature optimization is the root of all evil. You can always use the timeit module if you suspect something is slow.

As for your question, I store my data such that the measurements are indexed by rows and the dimensions are indexed by columns. This way the measurements themselves will (probably*) be continuous is memory. But the real reason is if I have measurements M.shape = (m, v) then M[n] will access the nth measurement, and that lends itself well to tidy code.

*a numpy array may not be continuous is memory if it is built oddly. np.ascontinuous will fix this.

BONUS:

If you are iteravely building an array it is best not to use a numpy array to begin with. They are not meant to be resized easilly. I would store all your data in a python list and use the append function to add new measurement. Then when you need to put the data in numpy you can call np.array on the python list and it will copy the data so efficient operations can be performed on it. But for dynamic storage go with python lists. They are very efficient.

Choice of Dimension on Numpy Arrays

Answers (1)

Related Questions