Amir
Amir

Reputation: 83

Pandas groupby.ngroup() in index order?

Pandas groupby "ngroup" function tags each group in "group" order.

I'm looking for similar behaviour but need the assigned tags to be in original (index) order, how can I do so efficiently (this will happen often with large arrays) in pandas and numpy?

> df = pd.DataFrame(
          {"A": [9,8,7,8,9]},
          index=list("abcde"))
   A
a  9
b  8
c  7
d  8
e  9
> df.groupby("A").ngroup()
a    2
b    1
c    0
d    1
e    2


# LOOKING FOR ###################
a    0
b    1
c    2
d    1
e    0

How can I achieve the desired output with a single dimension numpy array?

arr = np.array([9,8,7,8 ,9])
# looking for [0,1,2,1,0]

Upvotes: 2

Views: 4550

Answers (4)

Antony Hatchkins
Antony Hatchkins

Reputation: 34014

I've benchmarked the suggested solutions:

enter image description here

Turns out that:
factorize is the fastest for array sizes > 10³
unique-argsort is the fastest for array sizes < 10³ (but slower by a factor of 10 for larger ones),
ngroup is always slower, but for array sizes >3*10³ it has roughly the same speed as factorize.

from contextlib import contextmanager
from time import perf_counter as clock
from itertools import count
import numpy as np
import pandas as pd

def f1(a):
    return s.factorize()[0]

def f2(s):
    return s.groupby(s, sort=False).ngroup().values

def f3(s):
    u, idx, tags = np.unique(s.values, return_index=True, return_inverse=True)
    return idx.argsort().argsort()[tags]

@contextmanager
def bench(r):
    t1 = clock()
    yield
    t2 = clock()
    r.append(t2-t1)

res = []
for i in count():
    n = 2**i
    a = np.random.randint(0, n, n)
    s = pd.Series(a)
    rr = []
    for j in range(5):
        r = []
        with bench(r):
            a1 = f1(s)
        with bench(r):
            a2 = f2(s)
        with bench(r):
            a3 = f3(s)
        rr.append(r)
        if max(r) > 0.5:
            break
    res.append(np.min(rr, axis=0))
    if np.max(rr) > 0.4:
        break

np.save('results.npy', np.array(res))

Upvotes: 0

Divakar
Divakar

Reputation: 221644

You can use np.unique -

In [105]: a = np.array([9,8,7,8,9])

In [106]: u,idx,tags = np.unique(a, return_index=True, return_inverse=True)

In [107]: idx.argsort().argsort()[tags]
Out[107]: array([0, 1, 2, 1, 0])

Upvotes: 2

Craig
Craig

Reputation: 4855

You can pass sort=Flase to groupby():

df.groupby('A', sort=False).ngroup()

a    0
b    1
c    2
d    1
e    0
dtype: int64

As far as I can tell, there isn't a direct equivalent of groupby in numpy. For a pure numpy version, you can use numpy.unique() to get the unique values. numpy.unique() has the option to return the inverse, basically the array of indices that would recreate your input array, but it sorts the unique values first, so the result is the same as using the regular (sorted) pandas.groupby() command.

To get around this, you can capture the index values of the first occurrence of each unique value. Sort the index values and use these as indices into the original array to get the unique values in their original order. Create a dictionary to map between the unique values and the group numbers and then use that dictionary to convert the values in the array to the appropriate group numbers.

import numpy as np

arr = np.array([9, 8, 7, 8, 9])

_, i = np.unique(arr, return_index=True)  # get the indexes of the first occurence of each unique value
groups = arr[np.sort(i)]  # sort the indexes and retrieve the values from the array so that they are in the array order
m = {value:ngroup for ngroup, value in enumerate(groups)}  # create a mapping of value:groupnumber
np.vectorize(m.get)(arr)  # use vectorize to create a new array using m

array([0, 1, 2, 1, 0])

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150785

Perhaps a better way is factorize:

df['A'].factorize()[0]

Output:

array([0, 1, 2, 1, 0])

Upvotes: 2

Related Questions