How to improve pandas indexing and setting value speed

Question

I need to combine two large dataframes, which now takes hours. I wonder whether there is a faster way to do this. Below is the example: df1 contains some info about shirts, and df2 contains info about pants. I want to merge them into a new dataframe that shows all valid combinations. What is considered valid is that 'Size' and 'Gender' of shifts and pants should be the same while their color can defer. And here is a simplified example (The real scenario is that both df1 and df2 can be 10k lines)

    import pandas as pd
    import itertools
    list_color = ['black','white']
    list_size =  ['S','M','L']
    list_gender = ['M','F']
    list_shirts_price = [11,12,13,14,15,16,17,18]
    lists_shirts = [list_color, list_size, list_gender]
    list_pants_price = [21,22,23,24,25,26,27,28]
    lists_pants = [list_color, list_size, list_gender]
    df_shirts = pd.DataFrame(list(itertools.product(*lists_shirts)), columns=['Color', 'Size', 'Gender','Price'])
    df_shirts['Price'] = list_shirts_price
    df_pants = pd.DataFrame(list(itertools.product(*lists_pants)), columns=['Color', 'Size', 'Gender','Price'])
    df_pants['Price'] = list_pants_price

df_shirts:

	Color	Size	Gender	Price
0	black	S	M	11
1	black	S	F	12
2	black	L	M	13
3	black	L	F	14
4	white	S	M	15
5	white	S	F	16
6	white	L	M	17
7	white	L	F	18

df_pants:

	Color	Size	Gender	Price
0	black	S	M	21
1	black	S	F	22
2	black	L	M	23
3	black	L	F	24
4	white	S	M	25
5	white	S	F	26
6	white	L	M	27
7	white	L	F	28

Then, rename make a copy of df_shirts and double the number of rows. And then rename the column names.

    df_combined = df_shirts.copy(deep=True)
    df_combined=pd.concat([df_combined]*2, ignore_index=True)
    df_combined=df_combined.rename(columns={"Color": "Color_shirts", "Price": "Price_shirts"})
Then, add two columns called "Price_pants" and "Color_pants":

    list_new_column_from_pants = ['Price_pants']
    for key in list_new_column_from_pants:
            df_combined[key]=''
    df_combined['Color_pants']=''

df_combined:

Index	Color_shirts	Size	Gender	Price_shirts
0	black	S	M	11
1	black	S	F	12
2	black	L	M	13
3	black	L	F	14
4	white	S	M	15
5	white	S	F	16
6	white	L	M	17
7	white	L	F	18
8	black	S	M	11
9	black	S	F	12
10	black	L	M	13
11	black	L	F	14
12	white	S	M	15
13	white	S	F	16
14	white	L	M	17
15	white	L	F	18

Then below is the loop to set the empty values:

    for key_size, group_size in df_combined.groupby(['Size']):
        for key_gender, group_gender in group_size.groupby(['Gender']):
            for key_color_shirts, group_color_shirts in group_gender.groupby(['Color_shirts']):
                i=0;
                for index, row in group_color_shirts.iterrows():
                    df_combined.loc[index,'Color_pants']=list_color[i]
                    for key in ["Price_pants"]:# in reality, this there are multiple loops for 'key'
                        key_pants = key.replace('_pants','')
                        df_combined.loc[index,key]=df_pants[key_pants][(df_pants['Size']==key_size)
                                                            & (df_pants['Gender']==key_gender)
                                                            & (df_pants['Color']==list_color[i])].tolist()[0]
                    i=i+1

And below is the desired output.

Index	Color_shirts	Size	Gender	Price_shirts	Price_pants	Color_pants
0	black	S	M	11	21	black
1	black	S	F	12	22	black
2	black	L	M	13	23	black
3	black	L	F	14	24	black
4	white	S	M	15	21	black
5	white	S	F	16	22	black
6	white	L	M	17	23	black
7	white	L	F	18	24	black
8	black	S	M	11	25	white
9	black	S	F	12	26	white
10	black	L	M	13	27	white
11	black	L	F	14	28	white
12	white	S	M	15	25	white
13	white	S	F	16	26	white
14	white	L	M	17	27	white
15	white	L	F	18	28	white

In reality, both df_shirt and df_pants are 10k lines. And there are also more columns to set. So the execution time is on the ordre of a few hours.

Corralien · Accepted Answer

Avoid loop with pandas.

Input data:

>>> df_shirts
   Color Size Gender  Price
0  black    S      M     11
1  black    S      F     12
2  black    L      M     13
3  black    L      F     14
4  white    S      M     15
5  white    S      F     16
6  white    L      M     17
7  white    L      F     18

>>> df_pants
   Color Size Gender  Price
0  black    S      M     21
1  black    S      F     22
2  black    L      M     23
3  black    L      F     24
4  white    S      M     25
5  white    S      F     26
6  white    L      M     27
7  white    L      F     28

Use pd.merge:

out = pd.merge(df_shirts, df_pants, on=['Size', 'Gender'],
               suffixes=('_shirts', '_pants'))

Output result:

>>> out
   Color_shirts Size Gender  Price_shirts Color_pants  Price_pants
0         black    S      M            11       black           21
1         black    S      M            11       white           25
2         white    S      M            15       black           21
3         white    S      M            15       white           25
4         black    S      F            12       black           22
5         black    S      F            12       white           26
6         white    S      F            16       black           22
7         white    S      F            16       white           26
8         black    L      M            13       black           23
9         black    L      M            13       white           27
10        white    L      M            17       black           23
11        white    L      M            17       white           27
12        black    L      F            14       black           24
13        black    L      F            14       white           28
14        white    L      F            18       black           24
15        white    L      F            18       white           28

You set index to ('Size', 'Gender') for better filtering:

>>> out.set_index(['Size', 'Gender']).sort_index()
            Color_shirts  Price_shirts Color_pants  Price_pants
Size Gender
L    F             black            14       black           24
     F             black            14       white           28
     F             white            18       black           24
     F             white            18       white           28
     M             black            13       black           23
     M             black            13       white           27
     M             white            17       black           23
     M             white            17       white           27
S    F             black            12       black           22
     F             black            12       white           26
     F             white            16       black           22
     F             white            16       white           26
     M             black            11       black           21
     M             black            11       white           25
     M             white            15       black           21
     M             white            15       white           25

How to improve pandas indexing and setting value speed

Answers (1)

Related Questions