Abiel
Abiel

Reputation: 5455

Group by max or min in a numpy array

I have two equal-length 1D numpy arrays, id and data, where id is a sequence of repeating, ordered integers that define sub-windows on data. For example:

id  data
1     2
1     7
1     3
2     8
2     9
2    10
3     1
3   -10

I would like to aggregate data by grouping on id and taking either the max or the min.

In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id.

Is there a way I can avoid Python loops and do this in a vectorized manner?

Upvotes: 12

Views: 10781

Answers (8)

Marco Cerliani
Marco Cerliani

Reputation: 22031

with only numpy and without loops:

id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])

# max
_ndx = np.argsort(id)
_id, _pos  = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)

# min
_ndx = np.argsort(id)
_id, _pos  = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)

# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])

(pd_group.values == np_group.values).all()  # TRUE

Upvotes: 5

Eelco Hoogendoorn
Eelco Hoogendoorn

Reputation: 10759

Ive packaged a version of my previous answer in the numpy_indexed package; its nice to have this all wrapped up and tested in a neat interface; plus it has a lot more functionality as well:

import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)

And so on

Upvotes: 3

dan-man
dan-man

Reputation: 2979

I'm fairly new to Python and Numpy but, it seems like you can use the .at method of ufuncs rather than reduceat:

import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)

For example:

data_val = array([ 0.65753453,  0.84279716,  0.88189818,  0.18987882,  0.49800668,
    0.29656994,  0.39542769,  0.43155428,  0.77982853,  0.44955868,
    0.22080219,  0.4807312 ,  0.9288989 ,  0.10956681,  0.73215416,
    0.33184318,  0.10936647])
ans = array([ 0.98969952,  0.84044947,  0.63460516,  0.92042078,  0.75738113,
    0.37976055])

Of course this only makes sense if your data_id values are suitable for use as indices (i.e. non-negative integers and not huge...presumably if they are large/sparse you could initialize ans using np.unique(data_id) or something).

I should point out that the data_id doesn't actually need to be sorted.

Upvotes: 4

Eelco Hoogendoorn
Eelco Hoogendoorn

Reputation: 10759

A slightly faster and more general answer than the already accepted one; like the answer by joeln it avoids the more expensive lexsort, and it works for arbitrary ufuncs. Moreover, it only demands that the keys are sortable, rather than being ints in a specific range. The accepted answer may still be faster though, considering the max/min isn't explicitly computed. The ability to ignore nans of the accepted solution is neat; but one may also simply assign nan values a dummy key.

import numpy as np

def group(key, value, operator=np.add):
    """
    group the values by key
    any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
    returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
    """
    #upcast to numpy arrays
    key = np.asarray(key)
    value = np.asarray(value)
    #first, sort by key
    I = np.argsort(key)
    key = key[I]
    value = value[I]
    #the slicing points of the bins to sum over
    slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
    #first entry of each bin is a unique key
    unique_keys = key[slices]
    #reduce over the slices specified by index
    per_key_sum = operator.reduceat(value, slices)
    #number of counts per key is the difference of our slice points. cap off with number of keys for last bin
    key_count = np.diff(np.append(slices, len(key)))
    return unique_keys, per_key_sum, key_count


names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]

unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values

Upvotes: 1

joeln
joeln

Reputation: 3643

The following solution only requires a sort on the data (not a lexsort) and does not require finding boundaries between groups. It relies on the fact that if o is an array of indices into r then r[o] = x will fill r with the latest value x for each value of o, such that r[[0, 0]] = [1, 2] will return r[0] = 2. It requires that your groups are integers from 0 to number of groups - 1, as for numpy.bincount, and that there is a value for every group:

def group_min(groups, data):
    n_groups = np.max(groups) + 1
    result = np.empty(n_groups)
    order = np.argsort(data)[::-1]
    result[groups.take(order)] = data.take(order)
    return result

def group_max(groups, data):
    n_groups = np.max(groups) + 1
    result = np.empty(n_groups)
    order = np.argsort(data)
    result[groups.take(order)] = data.take(order)
    return result

Upvotes: 0

Bi Rico
Bi Rico

Reputation: 25813

I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.

import numpy as np
def group_min(groups, data):
    # sort with major key groups, minor key data
    order = np.lexsort((data, groups))
    groups = groups[order] # this is only needed if groups is unsorted
    data = data[order]
    # construct an index which marks borders between groups
    index = np.empty(len(groups), 'bool')
    index[0] = True
    index[1:] = groups[1:] != groups[:-1]
    return data[index]

#max is very similar
def group_max(groups, data):
    order = np.lexsort((data, groups))
    groups = groups[order] #this is only needed if groups is unsorted
    data = data[order]
    index = np.empty(len(groups), 'bool')
    index[-1] = True
    index[:-1] = groups[1:] != groups[:-1]
    return data[index]

Upvotes: 9

jfs
jfs

Reputation: 414235

In pure Python:

from itertools import groupby, imap, izip
from operator  import itemgetter as ig

print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]

A variation:

print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]

Based on @Bago's answer:

import numpy as np

# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]

# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10  1]

If pandas is installed:

from pandas import DataFrame

df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1    7
# 2    10
# 3    1

Upvotes: 6

mtrw
mtrw

Reputation: 35088

I think this accomplishes what you're looking for:

[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]

For the outer list comprehension, from right to left, set(id) groups the ids, sorted() sorts them, for k ... iterates over them, and max takes the max of, in this case, another list comprehension. So moving to that inner list comprehension: enumerate(data) returns both the index and value from data, if id[val] == k picks out the data members corresponding to id k.

This iterates over the full data list for each id. With some preprocessing into sublists, it might be possible to speed it up, but it won't be a one-liner then.

Upvotes: 0

Related Questions