Reputation: 415
I am working on a problem statement where there are two dataframes df1 and df_main.
df_main is as follows :
users = ['id1','id1','id2','id2','id3','id3','id4']
keywords = ['k1','k1', 'k2','k2','k2','k3','k3']
quantity = [10,10,2,2,2,4,4]
duration = [1,1,3,3,3,2,2]
df_main = pd.DataFrame(list(zip(users, keywords, quantity, duration)), columns = ['users','keywords','quantity','duration'])
df_main is basically a dataframe with user_id information, their corresponding keywords and the qty and duration columns
df1 has one column for user_id and the remaining columns for all keywords in df_main. Using the main_df as reference, Each user_id and keyword pair is flagged as 1 otherwise it's kept as 0.
Here's the code for df1 :
columns = ['USER_ID','k1','k2','k3']
users = ['id1','id2','id3','id4']
values1 = [1,0,0,0]
values2 = [0,1,1,0]
values3 = [0,0,1,1]
df1 = pd.DataFrame(list(zip(users, values1, values2, values3)), columns = columns)
Now as a result I want the following dataframe :
Where total_quantity and total_duration is the sum of the quantity and duration values for each id and keyword pair.
I have tried the following code :
all_keywords = df1.columns.tolist()[1:]
data_list = []
for keyword in all_keywords:
ID_selected = df1[df1[keyword] == 1]['ID'].values.tolist()
indexes = df1[df1[keyword] == 1].index.tolist()
qty_list = [0] * len(df1)
duration_list = [0] * len(df1)
all_qty = []
all_duration = []
for id in ID_selected:
all_qty.append(np.sum(df_main[(df_main['users'] == id) & (df_main['keywords'] == keyword)]['quantity'].values.tolist()))
all_duration.append(np.sum(df_main[(df_main['users'] == umi) & (df_main['keyword'] == meds)]['DaysSupply'].values.tolist()))
for index, qty, duration in zip(indexes, all_qty, all_duration):
qty_list[index] = qty
duration_list[index] = duration
d_temp = pd.DataFrame(list(zip(qty_list, duration_list)), columns = [keyword+'qty', keyword+'duration'])
data_list.append(d_temp)
result = pd.concat(data_list)
The code is working, but it's really slow and I really want to get rid of the loops. I would appreciate if someone could show me a more optimized way of doing this.
Upvotes: 1
Views: 55
Reputation: 5183
The main problem with performance in your code is the multiple loops. You can delegate all that looping to C implementations of numpy by using pandas built-in methods.
For example, use df_main.groupby
with sum, reshape, and flatten index... then merge with df1
.
df_temp = df_main.groupby(['users', 'keywords']).sum().unstack()
df_temp.columns = 'total_' + df_temp.columns.map('_'.join) # flatten column index
df1 = df1.merge(df_temp, left_on='USER_ID', right_on='users')
Output
total_quantity_k1 total_quantity_k2 total_quantity_k3 \
users
id1 20.0 NaN NaN
id2 NaN 4.0 NaN
id3 NaN 2.0 4.0
id4 NaN NaN 4.0
total_duration_k1 total_duration_k2 total_duration_k3
users
id1 2.0 NaN NaN
id2 NaN 6.0 NaN
id3 NaN 3.0 2.0
id4 NaN NaN 2.0
USER_ID k1 k2 k3 total_quantity_k1 total_quantity_k2 \
0 id1 1 0 0 20.0 NaN
1 id2 0 1 0 NaN 4.0
2 id3 0 1 1 NaN 2.0
3 id4 0 0 1 NaN NaN
total_quantity_k3 total_duration_k1 total_duration_k2 total_duration_k3
0 NaN 2.0 NaN NaN
1 NaN NaN 6.0 NaN
2 4.0 NaN 3.0 2.0
3 4.0 NaN NaN 2.0
Upvotes: 1