Reputation: 362
I have a dataframe with non-unique indices on purpose. I want to do an operation on the rows where the key is equal as is shown below. For each unique key I want to sum the first of 'other numbers' with every single 'numbers' column. Is this possible without splitting the dataframe or other time consuming operations?
import pandas as pd
d = {'key':['a', 'a', 'b','b'],
'numbers':[10,20,30,40],
'other_numbers':[1,2,3,4]
}
df = pd.DataFrame(data=d)
df = df.set_index('key')
print df
## numbers other_numbers new
## key
## a 10 1 11
## a 20 2 21
## b 30 3 33
## b 40 4 43
Upvotes: 1
Views: 129
Reputation: 863166
You can use duplicated
for first values of duplicated index what is used for filtering other_numbers
column by mask
for NaNs
which are replaced by ffill
(fillna
with method='ffill'
):
df['new'] = df['numbers'] + df['other_numbers'].mask(df.index.duplicated()).ffill().astype(int)
print (df)
numbers other_numbers new
key
a 10 1 11
a 20 2 21
b 30 3 33
b 40 4 43
Timings:
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'numbers': np.random.randint(20,size=N),
'other_numbers': np.random.randint(10,size=N)},
index=np.random.randint(20000,size=N)).sort_index()
df.index.name = 'key'
print (df)
In [83]: %timeit df['new'] = df['numbers'] + df['other_numbers'].mask(df.index.duplicated()).ffill().astype(int)
10 loops, best of 3: 34.8 ms per loop
In [84]: %timeit df.assign(new1=df.groupby('key')['other_numbers'].transform('first')+df['numbers'])
10 loops, best of 3: 64.7 ms per loop
Upvotes: 1
Reputation: 210882
one way would be:
In [28]: df.assign(new=df.groupby('key')['other_numbers'].transform('first')+df['numbers'])
Out[28]:
numbers other_numbers new
key
a 10 1 11
a 20 2 21
b 30 3 33
b 40 4 43
Upvotes: 1