Calculating some statistics for each column of a numpy ndarray

Question

I have a 4D numpy array of input data where each column represents a quantity (say speed, acceleration, etc) and I would like to calculate some statistical information for each quantity (mean, st-dev. meadian, 75, 85 and 95 percentiles.

So for example:

input_shape = (1,200,4)
n_sample = 100

X = np.random.uniform(0,1, (n_sample,) + input_shape)
X.shape
(100, 1, 200, 4)

X[0]
array([[[0.50410922, 0.82829892, 0.72460878, 0.0562701 ],
        [0.49223423, 0.14152948, 0.32285973, 0.49056405],
        ...
        [0.8299407 , 0.78446729, 0.40959698, 0.893117  ],
        [0.25150705, 0.56759064, 0.28280459, 0.0599566 ]]])

Each column of X represents some physical quantity for 200 data-points. The statistics of each quantity is what I'm interested in.

EDIT

I would expect something like:

[[[col1_mean, col2_mean, col3_mean, col4_mean ],
   [col1_std, col2_std, col3_std, col4_mean],
   [col1_med, col2_med, col3_med, col4_med],
   [col1_p75, col2_p75, col3_p75, col4_p75 ],
   [col1_p85, col2_p85, col3_p85, col4_p85 ],
   [col1_p95, col2_p95, col3_p95, col4_p95 ]]]

So the result is shaped (100, 1, 6, 4)

bnaecker · Accepted Answer

The easiest thing would be to compute the statistics of interest by supplying an axis argument. This is used by many NumPy functions to run their computation along that axis. For your data, it seems you'd like to compute across the "data points" dimension, which is axis=2. For example:

>>> input_shape = (1,200,4)
>>> n_sample = 100
>>> X = np.random.uniform(0,1, (n_sample,) + input_shape)
>>> X.shape
(100, 1, 200, 4)
>>> X.mean(axis=2).shape  # Compute mean along 3rd axis
(100, 1, 4)
>>> stat_functions = (np.mean, np.std, np.med)
>>> stats = [func(X, axis=2) for func in stat_functions]
>>> list(map(np.shape, stats))
[(100, 1, 4), (100, 1, 4), (100, 1, 4)]

You'll have to do a bit more work to create functions to compute the percentiles you're interested in:

>>> import functools
>>> percentiles = tuple(functools.partial(np.percentile, q=q) for q in (75, 85, 95))
>>> stat_functions = (np.mean, np.std, np.median) + percentiles

If you want to join these into a single array, you can use the keepdims kwarg of each to avoid removing the axis along which the function is applied, and then concatenate the results:

>>> stats = np.concatenate([func(X, axis=2, keepdims=True) for func in stat_functions], axis=2)
>>> stats.shape
(100, 1, 6, 4)

Calculating some statistics for each column of a numpy ndarray

Answers (2)

Related Questions