qfd
qfd

Reputation: 788

enumerate groups in a dataframe

i have the following table

date        ui  mw  maxw    tC  HL    msurp
01/03/2004  A   10   10     eC  0.25   0.1
01/04/2004  A   10   10     eC  0.25   -0.1
01/03/2004  B   20   20     bC  0.5    0.3
01/03/2004  B   20   20     bC  0.25    0.3

what i am looking to do is add a column to this table that basically enumerates the unique combinations of ui, mw, maxw, tC and HL and enumerates

so for example in the above table

unique combinations of ui, mw, maxw, tC and HL are

 A,10, 10, eC, 0.25
 B,20, 20, bC, 0.5
 B,20, 20, bC, 0.5

There are total 3 so the output should be something like

date        ui  mw  maxw    tC  HL    msurp  counter
01/03/2004  A   10   10     eC  0.25   0.1    1
01/04/2004  A   10   10     eC  0.25   -0.1   1
01/03/2004  B   20   20     bC  0.5    0.3    2
01/03/2004  B   20   20     bC  0.25    0.3   3

Upvotes: 7

Views: 3792

Answers (2)

piRSquared
piRSquared

Reputation: 294218

Option 1
pd.Series.factorize

df.assign(
   counter=df[['ui', 'mw', 'maxw', 'tC', 'HL']].apply(tuple, 1).factorize()[0] + 1)

         date ui  mw  maxw  tC    HL  msurp  counter
0  01/03/2004  A  10    10  eC  0.25    0.1        1
1  01/04/2004  A  10    10  eC  0.25   -0.1        1
2  01/03/2004  B  20    20  bC  0.50    0.3        2
3  01/03/2004  B  20    20  bC  0.25    0.3        3

Option 1.5
More obnoxious version of option 1 but should be faster

df.assign(
    counter=pd.factorize(list(zip(
        *[df[c].values.tolist() for c in ['ui', 'mw', 'maxw', 'tC', 'HL']]
    )))[0] + 1
)

         date ui  mw  maxw  tC    HL  msurp  counter
0  01/03/2004  A  10    10  eC  0.25    0.1        1
1  01/04/2004  A  10    10  eC  0.25   -0.1        1
2  01/03/2004  B  20    20  bC  0.50    0.3        2
3  01/03/2004  B  20    20  bC  0.25    0.3        3

Option 2
@ayhan's answer (will delete if he posts it)

df.assign(
    counter=df.groupby(['ui', 'mw', 'maxw', 'tC', 'HL']).ngroup() + 1)

         date ui  mw  maxw  tC    HL  msurp  counter
0  01/03/2004  A  10    10  eC  0.25    0.1        1
1  01/04/2004  A  10    10  eC  0.25   -0.1        1
2  01/03/2004  B  20    20  bC  0.50    0.3        3
3  01/03/2004  B  20    20  bC  0.25    0.3        2

Timing
Code Below

(lambda r: r.div(r.min(1), 0).assign(best=lambda x: x.idxmin(1)))(results)

             pir1      pir2     ayhan   best
100     17.260639  1.000000  3.438354   pir2
300     30.550010  1.000000  2.598456   pir2
1000    43.201163  1.000000  1.236190   pir2
3000    61.593932  1.000000  1.025420   pir2
10000  127.003138  2.177171  1.000000  ayhan

enter image description here

pir1 = lambda d: d.assign(counter=d[['ui', 'mw', 'maxw', 'tC', 'HL']].apply(tuple, 1).factorize()[0] + 1)
pir2 = lambda d: d.assign(counter=pd.factorize(list(zip(*[d[c].values.tolist() for c in ['ui', 'mw', 'maxw', 'tC', 'HL']])))[0] + 1)
ayhan = lambda d: d.assign(counter=d.groupby(['ui', 'mw', 'maxw', 'tC', 'HL']).ngroup() + 1)

results = pd.DataFrame(
    index=[100, 300, 1000, 3000, 10000],
    columns='pir1 pir2 ayhan'.split(),
    dtype=float
)

for i in results.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in results.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setp, number=10))

results.plot(loglog=True)

Upvotes: 11

BENY
BENY

Reputation: 323226

Like ayhan's answer , assume order is not important

df[['ui','mw','maxw','tC','HL']].T.apply(lambda x : ','.join(x.astype(str))).astype('category').cat.codes


Out[1247]: 
0    0
1    0
2    2
3    1
dtype: int8

as you said i can then aggregate by this instead of specifying group by [ui, mw, maxw etc

just do this , and groupby('counter')

 df['counter']=df[['ui','mw','maxw','tC','HL']].T.apply(lambda x : ','.join(x.astype(str)))

Upvotes: 2

Related Questions