Reputation: 725
I have one dataset (df1) where I would like to fill in data from my second dataset (df2).There is only one column in both dataframes which overlap and I have set that column to the index of both df1 and df2 so I can therefore merge on the index.
df = pd.read_excel('Data.xlsx', sheetname= 'Dataset1')
df2 = pd.read_excel('Data.xlsx', sheetname= 'Dataset2')
df1.set_index("ORG_ID", inplace=True)
df2.set_index("ORG_ID", inplace=True)
df3 = df1.merge(df2.ix[:,df2.columns-df1.columns], left_index=True, right_index=True, how="outer")
What I would like the output to be is a a new dataset (df3) which lists all the data from df1, including the index (ORG_IDs) , and includes all the new columns from df2 with filled in data based on the ORG_IDs listed in df1. What python seems to be doing here is giving me a new dataframe (df3), filling in the data for df1, and then adding all the Org_ids from the second dataset (df2) below the ORG_IDs from df1, which is not what I want.
I have also tried using combine_first but it seemed to yield a similar result.
df3= df1.combine_first(df2)
Dataset1 (df1)
ORG_ID COUNTRY TOWN STORE PRODUCT PRICE
1 Spain Madrid Pink Garment 100
2 Greece Chania White Toy 200
3 U.K Manchester Red Garment 300
4 Italy Rome Red Accessory 500
5 Spain Marbella Blue Accessory 20
6 Greece Chania Green Garment 25
7 U.K Manchester Pink Toy 36
8 Italy Siena Red Accessory 150
9 Spain Barcelona White Toy 200
10 Greece Corfu Blue Accessory 500
Dataset2 (df2)
ORG_ID CUSTOMER TYPE PARENT REGION
5 A Pop Rose Europe
10 A Cry Tulip Europe
24 C Fig Lily Europe
89 G Pop Rose Europe
6 R Fig Lily Europe
4 Y Pop Rose Europe
1 T Fig Tulip Europe
7 H Pop Tulip Europe
8 S Fig Rose Europe
Dataset3 (df3) - What I would like
ORG_ID COUNTRY TOWN STORE PRODUCT PRICE CUSTOMER TYPE PARENT REGION
1 Spain Madrid Pink Garment 100 T Fig Tulip Europe
2 Greece Chania White Toy 200 NaN NaN NaN NaN
3 U.K Manchester Red Garment 300 NaN NaN NaN NaN
4 Italy Rome Red Accessory 500 Y Pop Rose Europe
5 Spain Marbella Blue Accessory 20 A Pop Rose Europe
6 Greece Chania Green Garment 25 R Fig Lily Europe
7 U.K Manchester Pink Toy 36 H Pop Tulip Europe
8 Italy Siena Red Accessory 150 S Fig Rose Europe
9 Spain Barcelona White Toy 200 NaN NaN NaN NaN
10 Greece Corfu Blue Accessory 500 A Cry Tulip Europe
Upvotes: 0
Views: 85
Reputation: 153510
You don't have to set_index
in your datafames. You can use merge
with the on
parameter and how='left'
.
df1 = pd.read_excel('Data.xlsx', sheetname= 'Dataset1')
df2 = pd.read_excel('Data.xlsx', sheetname= 'Dataset2')
df3 = df1.merge(df2, how='left', on='ORG_ID')
Output:
ORG_ID COUNTRY TOWN STORE PRODUCT PRICE CUSTOMER TYPE PARENT \
0 1 Spain Madrid Pink Garment 100 T Fig Tulip
1 2 Greece Chania White Toy 200 NaN NaN NaN
2 3 U.K Manchester Red Garment 300 NaN NaN NaN
3 4 Italy Rome Red Accessory 500 Y Pop Rose
4 5 Spain Marbella Blue Accessory 20 A Pop Rose
5 6 Greece Chania Green Garment 25 R Fig Lily
6 7 U.K Manchester Pink Toy 36 H Pop Tulip
7 8 Italy Siena Red Accessory 150 S Fig Rose
8 9 Spain Barcelona White Toy 200 NaN NaN NaN
9 10 Greece Corfu Blue Accessory 500 A Cry Tulip
REGION
0 Europe
1 NaN
2 NaN
3 Europe
4 Europe
5 Europe
6 Europe
7 Europe
8 NaN
9 Europe
Upvotes: 2