user47
user47

Reputation: 415

Search Optimization

I am working on a problem statement where there are two dataframes df1 and df_main.

df_main is as follows :

users = ['id1','id1','id2','id2','id3','id3','id4']
keywords = ['k1','k1', 'k2','k2','k2','k3','k3']
quantity = [10,10,2,2,2,4,4]
duration  = [1,1,3,3,3,2,2]

df_main = pd.DataFrame(list(zip(users, keywords, quantity, duration)), columns = ['users','keywords','quantity','duration'])

df_main is basically a dataframe with user_id information, their corresponding keywords and the qty and duration columns

df1 has one column for user_id and the remaining columns for all keywords in df_main. Using the main_df as reference, Each user_id and keyword pair is flagged as 1 otherwise it's kept as 0.

Here's the code for df1 :

columns = ['USER_ID','k1','k2','k3']
users = ['id1','id2','id3','id4']
values1 = [1,0,0,0]
values2 = [0,1,1,0]
values3 = [0,0,1,1]




df1 = pd.DataFrame(list(zip(users, values1, values2, values3)), columns = columns)

Now as a result I want the following dataframe :

enter image description here

Where total_quantity and total_duration is the sum of the quantity and duration values for each id and keyword pair.

I have tried the following code :

    all_keywords = df1.columns.tolist()[1:]
    data_list = []
    for keyword in all_keywords:
        ID_selected = df1[df1[keyword] == 1]['ID'].values.tolist()
        indexes = df1[df1[keyword] == 1].index.tolist()

        qty_list = [0] * len(df1)
        duration_list = [0] * len(df1)
        all_qty = []
        all_duration = []
        for id in ID_selected:
            all_qty.append(np.sum(df_main[(df_main['users'] == id) & (df_main['keywords'] == keyword)]['quantity'].values.tolist()))
            all_duration.append(np.sum(df_main[(df_main['users'] == umi) & (df_main['keyword'] == meds)]['DaysSupply'].values.tolist()))

        for index, qty, duration in zip(indexes, all_qty, all_duration):
            qty_list[index] = qty
            duration_list[index] = duration

        d_temp = pd.DataFrame(list(zip(qty_list, duration_list)), columns = [keyword+'qty', keyword+'duration'])

        data_list.append(d_temp)

    result = pd.concat(data_list)

The code is working, but it's really slow and I really want to get rid of the loops. I would appreciate if someone could show me a more optimized way of doing this.

Upvotes: 1

Views: 55

Answers (1)

RichieV
RichieV

Reputation: 5183

The main problem with performance in your code is the multiple loops. You can delegate all that looping to C implementations of numpy by using pandas built-in methods.

For example, use df_main.groupby with sum, reshape, and flatten index... then merge with df1.

df_temp = df_main.groupby(['users', 'keywords']).sum().unstack()
df_temp.columns = 'total_' + df_temp.columns.map('_'.join) # flatten column index

df1 = df1.merge(df_temp, left_on='USER_ID', right_on='users')

Output

       total_quantity_k1  total_quantity_k2  total_quantity_k3  \
users
id1                 20.0                NaN                NaN
id2                  NaN                4.0                NaN
id3                  NaN                2.0                4.0
id4                  NaN                NaN                4.0

       total_duration_k1  total_duration_k2  total_duration_k3
users
id1                  2.0                NaN                NaN
id2                  NaN                6.0                NaN
id3                  NaN                3.0                2.0
id4                  NaN                NaN                2.0
  USER_ID  k1  k2  k3  total_quantity_k1  total_quantity_k2  \
0     id1   1   0   0               20.0                NaN
1     id2   0   1   0                NaN                4.0
2     id3   0   1   1                NaN                2.0
3     id4   0   0   1                NaN                NaN

   total_quantity_k3  total_duration_k1  total_duration_k2  total_duration_k3
0                NaN                2.0                NaN                NaN
1                NaN                NaN                6.0                NaN
2                4.0                NaN                3.0                2.0
3                4.0                NaN                NaN                2.0

Upvotes: 1

Related Questions