Reputation: 651
I have two pandas DataFrames like this:
category 1 2 3
b 15 35 20
d 40 35 15
category total
a 10
b 10
c 10
d 10
e 10
f 10
In the second Dataframe the categories are unique, and there is only one row per category. In the first Dataframe a category can appear more than once.
I would like to add the element in column '2' in the first DataFrame to the corresponding element in the second DataFrame, the element in column '1' should be added to the cell above and the one in column '3' to the cell below.
Rendering this result:
category total
a 10 + 15
b 10 + 35
c 10 + 20 + 40
d 10 + 35
e 10 + 15
f 10
Is there a good way to do this using Pandas? I have a very large dataset so it is important to me that the approach I choose is fast and doesn't use too much memory. Would it be better if I did not use Pandas and used Numpy instead?
Upvotes: 3
Views: 113
Reputation: 1790
I found it hard to do it in a completely vectorized way : the thing is that in df2
, the same category can be in contiguous indices, and so the windows would overlap, making a loop (for me) necessary.
How I created the data :
df1 = pd.DataFrame(data=[[15, 35, 20],
[40, 35, 15]],
columns=[1, 2, 3],
index=['b', 'd'])
df2 = pd.DataFrame({'category': list('abcdef'),
'total': [10] * 6})
df2 = df2.set_index('category')
And then the processing part : the accu
array accumulates all the values that we will later add to the total
column.
accu = np.zeros_like(df2['total'].values.ravel())
for cat in df1.index.unique():
idx = df2.index.get_loc(cat)
accu[max(idx - 1, 0) : (idx + 2)] += np.sum(df1.loc[cat].values, axis=0)
df2['total'] += accu
It could surely be faster with numpy
broadcasting and smart-indexing functionalities, but in spite of memory efficiency in my opinion. Just tell me if this solution is not fast enough for you as is.
Upvotes: 1
Reputation: 5706
Somewhat long but doesn't require loops:
first = pd.DataFrame([['b', 15, 35, 20], ['d', 40, 35, 15]], columns = ['Category', '1', '2', '3'])
# Category 1 2 3
# 0 b 15 35 20
# 1 d 40 35 15
second = pd.DataFrame( [['a', 10 ],['b', 10],['c', 10 ],['d', 10 ], ['e', 10 ], ['f', 10 ]], columns=['Category', 'total'])
# Category total
# 0 a 10
# 1 b 10
# 2 c 10
# 3 d 10
# 4 e 10
# 5 f 10
There's a mapping for above and below values. if you think about it: the mapping of second['Category']
to first['Category']
can be summarized as below (second['category']
should have distinct values):
second['above_mapped'] = second['Category'].shift(-1)
second['below_mapped'] = second['Category'].shift(1)
print(second)
Category total above_mapped below_mapped
0 a 10 b NaN
1 b 10 c a
2 c 10 d b
3 d 10 e c
4 e 10 f d
5 f 10 NaN e
Using the above mapping table, I can create two dictionaries to define the mapping:
above_map = pd.Series(second['above_mapped'].values, index=second['Category']).to_dict()
# {'a': 'b', 'b': 'c', 'c': 'd', 'd': 'e', 'e': 'f', 'f': nan}
below_map = pd.Series(second['below_mapped'].values, index=second['Category']).to_dict()
# {'a': nan, 'b': 'a', 'c': 'b', 'd': 'c', 'e': 'd', 'f': 'e'}
I can create a mapping between the actual values to map by just using the first
dataframe:
above_values = pd.Series(first['1'].values, index=first['Category']).to_dict()
# {'b': 15, 'd': 40}
# middle values just correspond to the 'Category' column of first
middle_values = pd.Series(first['2'].values, index=first['Category']).to_dict()
# {'b': 35, 'd': 35}
below_values = pd.Series(first['3'].values, index=first['Category']).to_dict()
# {'b': 20, 'd': 15}
Now, mapping the numeric values to second
using above three dictionaries:
second['above_value_mapped'] = second['above_mapped'].map(above_values)
second['middle_value_mapped']= second['Category'].map(middle_values)
second['below_value_mapped'] = second['below_mapped'].map(below_values)
print(second)
Category total above_mapped below_mapped above_value_mapped \
0 a 10 b NaN 15.0
1 b 10 c a NaN
2 c 10 d b 40.0
3 d 10 e c NaN
4 e 10 f d NaN
5 f 10 NaN e NaN
middle_value_mapped below_value_mapped
0 NaN NaN
1 35.0 NaN
2 NaN 20.0
3 35.0 NaN
4 NaN 15.0
5 NaN NaN
Summing up the second['total']
, second['above_value_mapped']
, second['middle_value_mapped']
, and second['below_value_mapped']
columns gives the desired total
value:
second.fillna(0, inplace=True)
second['new_total'] = second['total'] + second['above_value_mapped'] + second['middle_value_mapped'] +second['below_value_mapped']
second[['Category', 'new_total']]
Category new_total
0 a 25.0
1 b 45.0
2 c 70.0
3 d 45.0
4 e 25.0
5 f 10.0
# Test dataframes
a =np.unique(pd.util.testing.rands_array(4, 10000)) #length 4 distinct strings as categories
# a.shape ~ 10000
df1 = pd.DataFrame(data=np.random.randint(1, 50, (a.shape[0], 3)),
columns=['1', '2', '3'],
index=a)
df2 = pd.DataFrame({'Category': a,
'total': [10]*a.shape[0]})
def akilat90(first, second):
second['above_mapped'] = second['Category'].shift(-1)
second['below_mapped'] = second['Category'].shift(1)
above_map = pd.Series(second['Category'].shift(-1).values, index=second['Category']).to_dict()
below_map = pd.Series(second['Category'].shift(1).values, index=second['Category']).to_dict()
above_values = pd.Series(first['1'].values, index=first['Category']).to_dict()
middle_values = pd.Series(first['2'].values, index=first['Category']).to_dict()
below_values = pd.Series(first['3'].values, index=first['Category']).to_dict()
second['above_value_mapped'] = second['above_mapped'].map(above_values)
second['middle_value_mapped']= second['Category'].map(middle_values)
second['below_value_mapped'] = second['below_mapped'].map(below_values)
second.fillna(0, inplace=True)
second['new_total'] = second['total'] + second['above_value_mapped'] + second['middle_value_mapped'] +second['below_value_mapped']
return second[['Category', 'new_total']]
def Jacquot(df1, df2):
df2 = df2.set_index('Category')
accu = np.zeros_like(df2['total'].values.ravel())
for cat in df1.index.unique():
idxs = np.where(df2.index == cat)[0]
for idx in idxs:
accu[max(idx - 1, 0) : (idx + 2)] += df1.loc[cat].values
df2['total'] += accu
return df2
%%timeit
akilat90(df1.reset_index().rename(columns = {'index':'Category'}), df2)
# 33.1 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
Jacquot(df1, df2)
# <Please check>
Upvotes: 1