Reputation: 788
i have the following table
date ui mw maxw tC HL msurp
01/03/2004 A 10 10 eC 0.25 0.1
01/04/2004 A 10 10 eC 0.25 -0.1
01/03/2004 B 20 20 bC 0.5 0.3
01/03/2004 B 20 20 bC 0.25 0.3
what i am looking to do is add a column to this table that basically enumerates the unique combinations of ui, mw, maxw, tC and HL and enumerates
so for example in the above table
unique combinations of ui, mw, maxw, tC and HL are
A,10, 10, eC, 0.25
B,20, 20, bC, 0.5
B,20, 20, bC, 0.5
There are total 3 so the output should be something like
date ui mw maxw tC HL msurp counter
01/03/2004 A 10 10 eC 0.25 0.1 1
01/04/2004 A 10 10 eC 0.25 -0.1 1
01/03/2004 B 20 20 bC 0.5 0.3 2
01/03/2004 B 20 20 bC 0.25 0.3 3
Upvotes: 7
Views: 3792
Reputation: 294218
Option 1
pd.Series.factorize
df.assign(
counter=df[['ui', 'mw', 'maxw', 'tC', 'HL']].apply(tuple, 1).factorize()[0] + 1)
date ui mw maxw tC HL msurp counter
0 01/03/2004 A 10 10 eC 0.25 0.1 1
1 01/04/2004 A 10 10 eC 0.25 -0.1 1
2 01/03/2004 B 20 20 bC 0.50 0.3 2
3 01/03/2004 B 20 20 bC 0.25 0.3 3
Option 1.5
More obnoxious version of option 1 but should be faster
df.assign(
counter=pd.factorize(list(zip(
*[df[c].values.tolist() for c in ['ui', 'mw', 'maxw', 'tC', 'HL']]
)))[0] + 1
)
date ui mw maxw tC HL msurp counter
0 01/03/2004 A 10 10 eC 0.25 0.1 1
1 01/04/2004 A 10 10 eC 0.25 -0.1 1
2 01/03/2004 B 20 20 bC 0.50 0.3 2
3 01/03/2004 B 20 20 bC 0.25 0.3 3
Option 2
@ayhan's answer (will delete if he posts it)
df.assign(
counter=df.groupby(['ui', 'mw', 'maxw', 'tC', 'HL']).ngroup() + 1)
date ui mw maxw tC HL msurp counter
0 01/03/2004 A 10 10 eC 0.25 0.1 1
1 01/04/2004 A 10 10 eC 0.25 -0.1 1
2 01/03/2004 B 20 20 bC 0.50 0.3 3
3 01/03/2004 B 20 20 bC 0.25 0.3 2
Timing
Code Below
(lambda r: r.div(r.min(1), 0).assign(best=lambda x: x.idxmin(1)))(results)
pir1 pir2 ayhan best
100 17.260639 1.000000 3.438354 pir2
300 30.550010 1.000000 2.598456 pir2
1000 43.201163 1.000000 1.236190 pir2
3000 61.593932 1.000000 1.025420 pir2
10000 127.003138 2.177171 1.000000 ayhan
pir1 = lambda d: d.assign(counter=d[['ui', 'mw', 'maxw', 'tC', 'HL']].apply(tuple, 1).factorize()[0] + 1)
pir2 = lambda d: d.assign(counter=pd.factorize(list(zip(*[d[c].values.tolist() for c in ['ui', 'mw', 'maxw', 'tC', 'HL']])))[0] + 1)
ayhan = lambda d: d.assign(counter=d.groupby(['ui', 'mw', 'maxw', 'tC', 'HL']).ngroup() + 1)
results = pd.DataFrame(
index=[100, 300, 1000, 3000, 10000],
columns='pir1 pir2 ayhan'.split(),
dtype=float
)
for i in results.index:
d = pd.concat([df] * i, ignore_index=True)
for j in results.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=10))
results.plot(loglog=True)
Upvotes: 11
Reputation: 323226
Like ayhan's answer , assume order is not important
df[['ui','mw','maxw','tC','HL']].T.apply(lambda x : ','.join(x.astype(str))).astype('category').cat.codes
Out[1247]:
0 0
1 0
2 2
3 1
dtype: int8
as you said i can then aggregate by this instead of specifying group by [ui, mw, maxw etc
just do this , and groupby('counter')
df['counter']=df[['ui','mw','maxw','tC','HL']].T.apply(lambda x : ','.join(x.astype(str)))
Upvotes: 2