Adding to window in pandas

Question

I have two pandas DataFrames like this:

category        1         2           3

b              15         35          20        
d              40         35          15

category           total

 a                  10 
 b                  10  
 c                  10 
 d                  10  
 e                  10 
 f                  10

In the second Dataframe the categories are unique, and there is only one row per category. In the first Dataframe a category can appear more than once.

I would like to add the element in column '2' in the first DataFrame to the corresponding element in the second DataFrame, the element in column '1' should be added to the cell above and the one in column '3' to the cell below.

Rendering this result:

category           total

 a                  10 + 15
 b                  10 + 35
 c                  10 + 20 + 40
 d                  10      + 35 
 e                  10      + 15
 f                  10

Is there a good way to do this using Pandas? I have a very large dataset so it is important to me that the approach I choose is fast and doesn't use too much memory. Would it be better if I did not use Pandas and used Numpy instead?

Louis R · Accepted Answer

I found it hard to do it in a completely vectorized way : the thing is that in df2, the same category can be in contiguous indices, and so the windows would overlap, making a loop (for me) necessary.

How I created the data :

df1 = pd.DataFrame(data=[[15, 35, 20], 
                         [40, 35, 15]], 
                   columns=[1, 2, 3], 
                   index=['b', 'd'])
df2 = pd.DataFrame({'category': list('abcdef'), 
                    'total': [10] * 6})
df2 = df2.set_index('category')

And then the processing part : the accu array accumulates all the values that we will later add to the total column.

accu = np.zeros_like(df2['total'].values.ravel())

for cat in df1.index.unique():
    idx = df2.index.get_loc(cat)
    accu[max(idx - 1, 0) : (idx + 2)] += np.sum(df1.loc[cat].values, axis=0)

df2['total'] += accu

It could surely be faster with numpy broadcasting and smart-indexing functionalities, but in spite of memory efficiency in my opinion. Just tell me if this solution is not fast enough for you as is.

Adding to window in pandas

Answers (2)

Timing

Related Questions