Reputation: 83
Pandas groupby "ngroup" function tags each group in "group" order.
I'm looking for similar behaviour but need the assigned tags to be in original (index) order, how can I do so efficiently (this will happen often with large arrays) in pandas and numpy?
> df = pd.DataFrame(
{"A": [9,8,7,8,9]},
index=list("abcde"))
A
a 9
b 8
c 7
d 8
e 9
> df.groupby("A").ngroup()
a 2
b 1
c 0
d 1
e 2
# LOOKING FOR ###################
a 0
b 1
c 2
d 1
e 0
How can I achieve the desired output with a single dimension numpy array?
arr = np.array([9,8,7,8 ,9])
# looking for [0,1,2,1,0]
Upvotes: 2
Views: 4550
Reputation: 34014
I've benchmarked the suggested solutions:
Turns out that:
— factorize
is the fastest for array sizes > 10³
— unique-argsort
is the fastest for array sizes < 10³ (but slower by a factor of 10 for larger ones),
— ngroup
is always slower, but for array sizes >3*10³ it has roughly the same speed as factorize
.
from contextlib import contextmanager
from time import perf_counter as clock
from itertools import count
import numpy as np
import pandas as pd
def f1(a):
return s.factorize()[0]
def f2(s):
return s.groupby(s, sort=False).ngroup().values
def f3(s):
u, idx, tags = np.unique(s.values, return_index=True, return_inverse=True)
return idx.argsort().argsort()[tags]
@contextmanager
def bench(r):
t1 = clock()
yield
t2 = clock()
r.append(t2-t1)
res = []
for i in count():
n = 2**i
a = np.random.randint(0, n, n)
s = pd.Series(a)
rr = []
for j in range(5):
r = []
with bench(r):
a1 = f1(s)
with bench(r):
a2 = f2(s)
with bench(r):
a3 = f3(s)
rr.append(r)
if max(r) > 0.5:
break
res.append(np.min(rr, axis=0))
if np.max(rr) > 0.4:
break
np.save('results.npy', np.array(res))
Upvotes: 0
Reputation: 221644
You can use np.unique
-
In [105]: a = np.array([9,8,7,8,9])
In [106]: u,idx,tags = np.unique(a, return_index=True, return_inverse=True)
In [107]: idx.argsort().argsort()[tags]
Out[107]: array([0, 1, 2, 1, 0])
Upvotes: 2
Reputation: 4855
You can pass sort=Flase
to groupby():
df.groupby('A', sort=False).ngroup()
a 0
b 1
c 2
d 1
e 0
dtype: int64
As far as I can tell, there isn't a direct equivalent of groupby
in numpy
. For a pure numpy
version, you can use numpy.unique()
to get the unique values. numpy.unique()
has the option to return the inverse, basically the array of indices that would recreate your input array, but it sorts the unique values first, so the result is the same as using the regular (sorted) pandas.groupby()
command.
To get around this, you can capture the index values of the first occurrence of each unique value. Sort the index values and use these as indices into the original array to get the unique values in their original order. Create a dictionary to map between the unique values and the group numbers and then use that dictionary to convert the values in the array to the appropriate group numbers.
import numpy as np
arr = np.array([9, 8, 7, 8, 9])
_, i = np.unique(arr, return_index=True) # get the indexes of the first occurence of each unique value
groups = arr[np.sort(i)] # sort the indexes and retrieve the values from the array so that they are in the array order
m = {value:ngroup for ngroup, value in enumerate(groups)} # create a mapping of value:groupnumber
np.vectorize(m.get)(arr) # use vectorize to create a new array using m
array([0, 1, 2, 1, 0])
Upvotes: 1
Reputation: 150785
Perhaps a better way is factorize
:
df['A'].factorize()[0]
Output:
array([0, 1, 2, 1, 0])
Upvotes: 2