repeat operation for equal indices in pandas dataframe

Question

I have a dataframe with non-unique indices on purpose. I want to do an operation on the rows where the key is equal as is shown below. For each unique key I want to sum the first of 'other numbers' with every single 'numbers' column. Is this possible without splitting the dataframe or other time consuming operations?

import pandas as pd


d = {'key':['a', 'a', 'b','b'],
    'numbers':[10,20,30,40],
     'other_numbers':[1,2,3,4]
     }

df = pd.DataFrame(data=d)
df = df.set_index('key')

print df

##      numbers  other_numbers    new
##  key  
##  a       10             1      11
##  a       20             2      21
##  b       30             3      33
##  b       40             4      43

jezrael · Accepted Answer

You can use duplicated for first values of duplicated index what is used for filtering other_numbers column by mask for NaNs which are replaced by ffill (fillna with method='ffill'):

df['new'] = df['numbers'] + df['other_numbers'].mask(df.index.duplicated()).ffill().astype(int)
print (df)
     numbers  other_numbers new
key                              
a         10              1  11
a         20              2  21
b         30              3  33
b         40              4  43

Timings:

np.random.seed(123)

N = 1000000

df = pd.DataFrame({'numbers': np.random.randint(20,size=N),
                   'other_numbers': np.random.randint(10,size=N)},
                    index=np.random.randint(20000,size=N)).sort_index()
df.index.name = 'key'
print (df)

In [83]: %timeit df['new'] = df['numbers'] + df['other_numbers'].mask(df.index.duplicated()).ffill().astype(int)
10 loops, best of 3: 34.8 ms per loop

In [84]: %timeit df.assign(new1=df.groupby('key')['other_numbers'].transform('first')+df['numbers'])
10 loops, best of 3: 64.7 ms per loop

repeat operation for equal indices in pandas dataframe

Answers (2)

Related Questions