Reputation: 388
I need to combine two large dataframes, which now takes hours. I wonder whether there is a faster way to do this. Below is the example: df1 contains some info about shirts, and df2 contains info about pants. I want to merge them into a new dataframe that shows all valid combinations. What is considered valid is that 'Size' and 'Gender' of shifts and pants should be the same while their color can defer. And here is a simplified example (The real scenario is that both df1 and df2 can be 10k lines)
import pandas as pd
import itertools
list_color = ['black','white']
list_size = ['S','M','L']
list_gender = ['M','F']
list_shirts_price = [11,12,13,14,15,16,17,18]
lists_shirts = [list_color, list_size, list_gender]
list_pants_price = [21,22,23,24,25,26,27,28]
lists_pants = [list_color, list_size, list_gender]
df_shirts = pd.DataFrame(list(itertools.product(*lists_shirts)), columns=['Color', 'Size', 'Gender','Price'])
df_shirts['Price'] = list_shirts_price
df_pants = pd.DataFrame(list(itertools.product(*lists_pants)), columns=['Color', 'Size', 'Gender','Price'])
df_pants['Price'] = list_pants_price
df_shirts:
Color | Size | Gender | Price | |
---|---|---|---|---|
0 | black | S | M | 11 |
1 | black | S | F | 12 |
2 | black | L | M | 13 |
3 | black | L | F | 14 |
4 | white | S | M | 15 |
5 | white | S | F | 16 |
6 | white | L | M | 17 |
7 | white | L | F | 18 |
df_pants:
Color | Size | Gender | Price | |
---|---|---|---|---|
0 | black | S | M | 21 |
1 | black | S | F | 22 |
2 | black | L | M | 23 |
3 | black | L | F | 24 |
4 | white | S | M | 25 |
5 | white | S | F | 26 |
6 | white | L | M | 27 |
7 | white | L | F | 28 |
Then, rename make a copy of df_shirts and double the number of rows. And then rename the column names.
df_combined = df_shirts.copy(deep=True)
df_combined=pd.concat([df_combined]*2, ignore_index=True)
df_combined=df_combined.rename(columns={"Color": "Color_shirts", "Price": "Price_shirts"})
Then, add two columns called "Price_pants" and "Color_pants":
list_new_column_from_pants = ['Price_pants']
for key in list_new_column_from_pants:
df_combined[key]=''
df_combined['Color_pants']=''
df_combined:
Index | Color_shirts | Size | Gender | Price_shirts | Price_pants | Color_pants |
---|---|---|---|---|---|---|
0 | black | S | M | 11 | ||
1 | black | S | F | 12 | ||
2 | black | L | M | 13 | ||
3 | black | L | F | 14 | ||
4 | white | S | M | 15 | ||
5 | white | S | F | 16 | ||
6 | white | L | M | 17 | ||
7 | white | L | F | 18 | ||
8 | black | S | M | 11 | ||
9 | black | S | F | 12 | ||
10 | black | L | M | 13 | ||
11 | black | L | F | 14 | ||
12 | white | S | M | 15 | ||
13 | white | S | F | 16 | ||
14 | white | L | M | 17 | ||
15 | white | L | F | 18 |
Then below is the loop to set the empty values:
for key_size, group_size in df_combined.groupby(['Size']):
for key_gender, group_gender in group_size.groupby(['Gender']):
for key_color_shirts, group_color_shirts in group_gender.groupby(['Color_shirts']):
i=0;
for index, row in group_color_shirts.iterrows():
df_combined.loc[index,'Color_pants']=list_color[i]
for key in ["Price_pants"]:# in reality, this there are multiple loops for 'key'
key_pants = key.replace('_pants','')
df_combined.loc[index,key]=df_pants[key_pants][(df_pants['Size']==key_size)
& (df_pants['Gender']==key_gender)
& (df_pants['Color']==list_color[i])].tolist()[0]
i=i+1
And below is the desired output.
Index | Color_shirts | Size | Gender | Price_shirts | Price_pants | Color_pants |
---|---|---|---|---|---|---|
0 | black | S | M | 11 | 21 | black |
1 | black | S | F | 12 | 22 | black |
2 | black | L | M | 13 | 23 | black |
3 | black | L | F | 14 | 24 | black |
4 | white | S | M | 15 | 21 | black |
5 | white | S | F | 16 | 22 | black |
6 | white | L | M | 17 | 23 | black |
7 | white | L | F | 18 | 24 | black |
8 | black | S | M | 11 | 25 | white |
9 | black | S | F | 12 | 26 | white |
10 | black | L | M | 13 | 27 | white |
11 | black | L | F | 14 | 28 | white |
12 | white | S | M | 15 | 25 | white |
13 | white | S | F | 16 | 26 | white |
14 | white | L | M | 17 | 27 | white |
15 | white | L | F | 18 | 28 | white |
In reality, both df_shirt and df_pants are 10k lines. And there are also more columns to set. So the execution time is on the ordre of a few hours.
Upvotes: 4
Views: 423
Reputation: 120399
Avoid loop with pandas
.
Input data:
>>> df_shirts
Color Size Gender Price
0 black S M 11
1 black S F 12
2 black L M 13
3 black L F 14
4 white S M 15
5 white S F 16
6 white L M 17
7 white L F 18
>>> df_pants
Color Size Gender Price
0 black S M 21
1 black S F 22
2 black L M 23
3 black L F 24
4 white S M 25
5 white S F 26
6 white L M 27
7 white L F 28
Use pd.merge
:
out = pd.merge(df_shirts, df_pants, on=['Size', 'Gender'],
suffixes=('_shirts', '_pants'))
Output result:
>>> out
Color_shirts Size Gender Price_shirts Color_pants Price_pants
0 black S M 11 black 21
1 black S M 11 white 25
2 white S M 15 black 21
3 white S M 15 white 25
4 black S F 12 black 22
5 black S F 12 white 26
6 white S F 16 black 22
7 white S F 16 white 26
8 black L M 13 black 23
9 black L M 13 white 27
10 white L M 17 black 23
11 white L M 17 white 27
12 black L F 14 black 24
13 black L F 14 white 28
14 white L F 18 black 24
15 white L F 18 white 28
You set index to ('Size', 'Gender') for better filtering:
>>> out.set_index(['Size', 'Gender']).sort_index()
Color_shirts Price_shirts Color_pants Price_pants
Size Gender
L F black 14 black 24
F black 14 white 28
F white 18 black 24
F white 18 white 28
M black 13 black 23
M black 13 white 27
M white 17 black 23
M white 17 white 27
S F black 12 black 22
F black 12 white 26
F white 16 black 22
F white 16 white 26
M black 11 black 21
M black 11 white 25
M white 15 black 21
M white 15 white 25
Upvotes: 4