Group by median for Numpy (without Pandas)

Question

Is it possible to calculate a median of one column based on groupings of another column without using pandas (and keeping my data in a Numpy array)?

For example, if this is the input:

arr = np.array([[0,1],[0,2],[0,3],[1,4],[1,5],[1,6]])

I want this as the output (using first column to group, and then taking the median of the second column:

ans = np.array([[0,2],[1,5]])

javidcf · Accepted Answer

If you want to avoid using Pandas for some reason, here is one possibility to do that computation. Note that, in the general case, the median is not an integer value (unless you round it or floor it), because for even-size groups it will be the average of the two middlemost elements, so you cannot have both the integer group id and median value in a single regular array (although you could in a structured array).

import numpy as np

def grouped_median(group, value):
    # Sort by group and value
    s = np.lexsort([value, group])
    arr2 = arr[s]
    group2 = group[s]
    value2 = value[s]
    # Look for group boundaries
    w = np.flatnonzero(np.diff(group2, prepend=group2[0] - 1, append=group2[-1] + 1))
    # Size of each group
    wd = np.diff(w)
    # Mid points of each group
    m1 = w[:-1] + wd // 2
    m2 = m1 - 1 + (wd % 2)
    # Group id
    group_res = group2[m1]
    # Group median value
    value_res = (value2[m1] + value2[m2]) / 2  # Use `// 2` or round for int result
    return group_res, value_res

# Test
arr = np.array([[0, 1], [0, 2], [0, 3], [1, 4], [1, 5], [1, 6]])
group_res, value_res = grouped_median(arr[:, 0], arr[:, 1])
# Print
for g, v in zip(group_res, value_res):
    print(g, v)
    # 0 2.0
    # 1 5.0
# As a structured array
res = np.empty(group_res.shape, dtype=[('group', group_res.dtype),
                                       ('median', value_res.dtype)])
res['group'] = group_res
res['median'] = value_res
print(res)
# [(0, 2.) (1, 5.)]

Group by median for Numpy (without Pandas)

Answers (1)

Related Questions