Reputation: 3722
Lets say I have a dataframe:
df = pd.DataFrame({'a':[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'b':[2016, 2017, 2018, 2019, 2000, 2000, 2000, 2000, 2007, 2008, 2014]})
I want to groupby this data and compare the number of years within groups a
that are less than others.
within group 1, 2016 is the 0th year, then 2017 is the 1st year (or has 1 year behind it).
I tried doing:
df['c'] = df.groupby('a')['b'].apply(lambda x: [sum(y > x) for y in x]).reset_index(drop=False)
but this is taking a really long time. I was wondering if there is a better way to do it. I'm working on 6.5M rows.
expected output:
a b c
0 1 2016 0
1 1 2017 1
2 1 2018 2
3 1 2019 3
4 2 2000 0
5 2 2000 0
6 2 2000 0
7 2 2000 0
8 3 2007 0
9 3 2008 1
10 3 2014 2
Upvotes: 1
Views: 156
Reputation: 792
Clean and efficient solution:
import pandas as pd
df = pd.DataFrame({'a':[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'b':[2016, 2017, 2018, 2019, 2000, 2000, 2000, 2000, 2007, 2008, 2014]})
df=df.set_index('a')
df=df.sort_index()
quickmap={}
for index in df.index.unique():
temphash={}
val=0
for i in df.loc[index]['b'].unique():
temphash[i]=val
val+=1
quickmap[index]=temphash
df=df.reset_index()
def toret(row):
key=row['a']
subkey=row['b']
return quickmap[key][subkey]
df['c']=df.apply(toret,axis=1)
Upvotes: 0
Reputation: 323236
numpy
solution
from scipy.stats import rankdata
np.concatenate([rankdata(x,method='min')for x in (np.split(df.b.values,np.flatnonzero(df.a.diff().fillna(0))))])-1
%timeit df.groupby(['a'])['b'].rank('min').sub(1)
1000 loops, best of 3: 845 µs per loop
%timeit df.groupby('a')['b'].transform(lambda x: pd.factorize(x)[0])
100 loops, best of 3: 1.77 ms per loop
%timeit df.groupby('a')['b'].apply(lambda x: [sum(y > x) for y in x]).reset_index(drop=False)
100 loops, best of 3: 2.71 ms per loop
%timeit np.concatenate([rankdata(x,method='min')for x in (np.split(df.b.values,np.flatnonzero(df.a.diff().fillna(0))))])-1
1000 loops, best of 3: 342 µs per loop
Upvotes: 1
Reputation: 3353
I would use rank
and afterwards sub
1 , which I think is very readable, and approximately twice as fast as the other answer and ~3.5 x as fast as the original approach:
df.groupby(['a'])['b'].rank('min').sub(1)
#0 0.0
#1 1.0
#2 2.0
#3 3.0
#4 0.0
#5 0.0
#6 0.0
#7 0.0
#8 0.0
#9 1.0
#10 2.0
%timeit df.groupby(['a'])['b'].rank('min').sub(1)
#1.58 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby('a')['b'].transform(lambda x: pd.factorize(x)[0])
#3.76 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby('a')['b'].apply(lambda x: [sum(y > x) for y in x]).reset_index(drop=False)
#5.32 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 1