dagrun
dagrun

Reputation: 651

Adding to window in pandas

I have two pandas DataFrames like this:

category        1         2           3

b              15         35          20        
d              40         35          15                       

category           total

 a                  10 
 b                  10  
 c                  10 
 d                  10  
 e                  10 
 f                  10 

In the second Dataframe the categories are unique, and there is only one row per category. In the first Dataframe a category can appear more than once.

I would like to add the element in column '2' in the first DataFrame to the corresponding element in the second DataFrame, the element in column '1' should be added to the cell above and the one in column '3' to the cell below.

Rendering this result:

category           total

 a                  10 + 15
 b                  10 + 35
 c                  10 + 20 + 40
 d                  10      + 35 
 e                  10      + 15
 f                  10 

Is there a good way to do this using Pandas? I have a very large dataset so it is important to me that the approach I choose is fast and doesn't use too much memory. Would it be better if I did not use Pandas and used Numpy instead?

Upvotes: 3

Views: 113

Answers (2)

Louis R
Louis R

Reputation: 1790

I found it hard to do it in a completely vectorized way : the thing is that in df2, the same category can be in contiguous indices, and so the windows would overlap, making a loop (for me) necessary.

How I created the data :

df1 = pd.DataFrame(data=[[15, 35, 20], 
                         [40, 35, 15]], 
                   columns=[1, 2, 3], 
                   index=['b', 'd'])
df2 = pd.DataFrame({'category': list('abcdef'), 
                    'total': [10] * 6})
df2 = df2.set_index('category')

And then the processing part : the accu array accumulates all the values that we will later add to the total column.

accu = np.zeros_like(df2['total'].values.ravel())

for cat in df1.index.unique():
    idx = df2.index.get_loc(cat)
    accu[max(idx - 1, 0) : (idx + 2)] += np.sum(df1.loc[cat].values, axis=0)

df2['total'] += accu

It could surely be faster with numpy broadcasting and smart-indexing functionalities, but in spite of memory efficiency in my opinion. Just tell me if this solution is not fast enough for you as is.

Upvotes: 1

akilat90
akilat90

Reputation: 5706

Somewhat long but doesn't require loops:

first = pd.DataFrame([['b', 15, 35, 20], ['d', 40, 35, 15]], columns = ['Category', '1', '2', '3'])

# Category  1    2   3
# 0 b      15   35  20
# 1 d      40   35  15

second = pd.DataFrame( [['a', 10 ],['b', 10],['c', 10 ],['d', 10  ], ['e', 10 ], ['f', 10 ]], columns=['Category', 'total'])

# Category  total
# 0 a      10
# 1 b      10
# 2 c      10
# 3 d      10
# 4 e      10
# 5 f      10

There's a mapping for above and below values. if you think about it: the mapping of second['Category'] to first['Category'] can be summarized as below (second['category'] should have distinct values):

second['above_mapped'] = second['Category'].shift(-1)
second['below_mapped'] = second['Category'].shift(1)

print(second)

  Category  total above_mapped below_mapped
0        a     10            b          NaN
1        b     10            c            a
2        c     10            d            b
3        d     10            e            c
4        e     10            f            d
5        f     10          NaN            e

Using the above mapping table, I can create two dictionaries to define the mapping:

above_map = pd.Series(second['above_mapped'].values, index=second['Category']).to_dict()
# {'a': 'b', 'b': 'c', 'c': 'd', 'd': 'e', 'e': 'f', 'f': nan}


below_map = pd.Series(second['below_mapped'].values, index=second['Category']).to_dict()
# {'a': nan, 'b': 'a', 'c': 'b', 'd': 'c', 'e': 'd', 'f': 'e'}

I can create a mapping between the actual values to map by just using the first dataframe:

above_values = pd.Series(first['1'].values, index=first['Category']).to_dict()
# {'b': 15, 'd': 40}

# middle values just correspond to the 'Category' column of first
middle_values = pd.Series(first['2'].values, index=first['Category']).to_dict()
# {'b': 35, 'd': 35}

below_values = pd.Series(first['3'].values, index=first['Category']).to_dict()
# {'b': 20, 'd': 15}

Now, mapping the numeric values to second using above three dictionaries:

second['above_value_mapped'] = second['above_mapped'].map(above_values)
second['middle_value_mapped']= second['Category'].map(middle_values)
second['below_value_mapped'] = second['below_mapped'].map(below_values)
print(second) 


  Category  total above_mapped below_mapped  above_value_mapped  \
0        a     10            b          NaN                15.0   
1        b     10            c            a                 NaN   
2        c     10            d            b                40.0   
3        d     10            e            c                 NaN   
4        e     10            f            d                 NaN   
5        f     10          NaN            e                 NaN   

   middle_value_mapped  below_value_mapped  
0                  NaN                 NaN  
1                 35.0                 NaN  
2                  NaN                20.0  
3                 35.0                 NaN  
4                  NaN                15.0  
5                  NaN                 NaN 

Summing up the second['total'], second['above_value_mapped'] , second['middle_value_mapped'], and second['below_value_mapped'] columns gives the desired total value:

second.fillna(0, inplace=True)
second['new_total'] = second['total'] + second['above_value_mapped'] + second['middle_value_mapped'] +second['below_value_mapped']
second[['Category', 'new_total']]

Category new_total
0   a    25.0
1   b    45.0
2   c    70.0
3   d    45.0
4   e    25.0
5   f    10.0

Timing

# Test dataframes
a =np.unique(pd.util.testing.rands_array(4, 10000)) #length 4 distinct strings as categories
# a.shape ~ 10000

df1 = pd.DataFrame(data=np.random.randint(1, 50, (a.shape[0], 3)), 
                   columns=['1', '2', '3'], 
                   index=a)
df2 = pd.DataFrame({'Category': a, 
                    'total': [10]*a.shape[0]})


def akilat90(first, second):

    second['above_mapped'] = second['Category'].shift(-1)
    second['below_mapped'] = second['Category'].shift(1)

    above_map = pd.Series(second['Category'].shift(-1).values, index=second['Category']).to_dict()       

    below_map = pd.Series(second['Category'].shift(1).values, index=second['Category']).to_dict()        

    above_values = pd.Series(first['1'].values, index=first['Category']).to_dict()        

    middle_values = pd.Series(first['2'].values, index=first['Category']).to_dict()        

    below_values = pd.Series(first['3'].values, index=first['Category']).to_dict()        

    second['above_value_mapped'] = second['above_mapped'].map(above_values)
    second['middle_value_mapped']= second['Category'].map(middle_values)
    second['below_value_mapped'] = second['below_mapped'].map(below_values)

    second.fillna(0, inplace=True)
    second['new_total'] = second['total'] + second['above_value_mapped'] + second['middle_value_mapped'] +second['below_value_mapped']
    return second[['Category', 'new_total']]


def Jacquot(df1, df2):
    df2 = df2.set_index('Category')
    accu = np.zeros_like(df2['total'].values.ravel())

    for cat in df1.index.unique():
        idxs = np.where(df2.index == cat)[0]
        for idx in idxs:
            accu[max(idx - 1, 0) : (idx + 2)] += df1.loc[cat].values

    df2['total'] += accu
    return df2



%%timeit
akilat90(df1.reset_index().rename(columns = {'index':'Category'}), df2)
# 33.1 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
Jacquot(df1, df2)
# <Please check>

Upvotes: 1

Related Questions