Reputation: 442
Is it possible to calculate a median of one column based on groupings of another column without using pandas (and keeping my data in a Numpy array)?
For example, if this is the input:
arr = np.array([[0,1],[0,2],[0,3],[1,4],[1,5],[1,6]])
I want this as the output (using first column to group, and then taking the median of the second column:
ans = np.array([[0,2],[1,5]])
Upvotes: 2
Views: 704
Reputation: 59731
If you want to avoid using Pandas for some reason, here is one possibility to do that computation. Note that, in the general case, the median is not an integer value (unless you round it or floor it), because for even-size groups it will be the average of the two middlemost elements, so you cannot have both the integer group id and median value in a single regular array (although you could in a structured array).
import numpy as np
def grouped_median(group, value):
# Sort by group and value
s = np.lexsort([value, group])
arr2 = arr[s]
group2 = group[s]
value2 = value[s]
# Look for group boundaries
w = np.flatnonzero(np.diff(group2, prepend=group2[0] - 1, append=group2[-1] + 1))
# Size of each group
wd = np.diff(w)
# Mid points of each group
m1 = w[:-1] + wd // 2
m2 = m1 - 1 + (wd % 2)
# Group id
group_res = group2[m1]
# Group median value
value_res = (value2[m1] + value2[m2]) / 2 # Use `// 2` or round for int result
return group_res, value_res
# Test
arr = np.array([[0, 1], [0, 2], [0, 3], [1, 4], [1, 5], [1, 6]])
group_res, value_res = grouped_median(arr[:, 0], arr[:, 1])
# Print
for g, v in zip(group_res, value_res):
print(g, v)
# 0 2.0
# 1 5.0
# As a structured array
res = np.empty(group_res.shape, dtype=[('group', group_res.dtype),
('median', value_res.dtype)])
res['group'] = group_res
res['median'] = value_res
print(res)
# [(0, 2.) (1, 5.)]
Upvotes: 3