spacedinosaur10
spacedinosaur10

Reputation: 725

Combine Datasets Pandas On Index Of Df1

I have one dataset (df1) where I would like to fill in data from my second dataset (df2).There is only one column in both dataframes which overlap and I have set that column to the index of both df1 and df2 so I can therefore merge on the index.

df = pd.read_excel('Data.xlsx', sheetname= 'Dataset1')
df2 = pd.read_excel('Data.xlsx', sheetname= 'Dataset2')
df1.set_index("ORG_ID", inplace=True)
df2.set_index("ORG_ID", inplace=True)
df3 = df1.merge(df2.ix[:,df2.columns-df1.columns], left_index=True, right_index=True, how="outer")

What I would like the output to be is a a new dataset (df3) which lists all the data from df1, including the index (ORG_IDs) , and includes all the new columns from df2 with filled in data based on the ORG_IDs listed in df1. What python seems to be doing here is giving me a new dataframe (df3), filling in the data for df1, and then adding all the Org_ids from the second dataset (df2) below the ORG_IDs from df1, which is not what I want.

I have also tried using combine_first but it seemed to yield a similar result.

df3= df1.combine_first(df2)


Dataset1 (df1)
ORG_ID  COUNTRY TOWN    STORE   PRODUCT PRICE
1   Spain   Madrid  Pink    Garment 100
2   Greece  Chania  White   Toy 200
3   U.K Manchester  Red Garment 300
4   Italy   Rome    Red Accessory   500
5   Spain   Marbella    Blue    Accessory   20
6   Greece  Chania  Green   Garment 25
7   U.K Manchester  Pink    Toy 36
8   Italy   Siena   Red Accessory   150
9   Spain   Barcelona   White   Toy 200
10  Greece  Corfu   Blue    Accessory   500

Dataset2 (df2)

ORG_ID  CUSTOMER    TYPE    PARENT  REGION
5   A   Pop Rose    Europe
10  A   Cry Tulip   Europe
24  C   Fig Lily    Europe
89  G   Pop Rose    Europe
6   R   Fig Lily    Europe
4   Y   Pop Rose    Europe
1   T   Fig Tulip   Europe
7   H   Pop Tulip   Europe
8   S   Fig Rose    Europe

Dataset3 (df3) - What I would like

ORG_ID  COUNTRY TOWN    STORE   PRODUCT PRICE   CUSTOMER    TYPE    PARENT  REGION
1   Spain   Madrid  Pink    Garment 100 T   Fig Tulip   Europe
2   Greece  Chania  White   Toy 200 NaN NaN NaN NaN
3   U.K Manchester  Red Garment 300 NaN NaN NaN NaN
4   Italy   Rome    Red Accessory   500 Y   Pop Rose    Europe
5   Spain   Marbella    Blue    Accessory   20  A   Pop Rose    Europe
6   Greece  Chania  Green   Garment 25  R   Fig Lily    Europe
7   U.K Manchester  Pink    Toy 36  H   Pop Tulip   Europe
8   Italy   Siena   Red Accessory   150 S   Fig Rose    Europe
9   Spain   Barcelona   White   Toy 200 NaN NaN NaN NaN
10  Greece  Corfu   Blue    Accessory   500 A   Cry Tulip   Europe

Upvotes: 0

Views: 85

Answers (1)

Scott Boston
Scott Boston

Reputation: 153510

You don't have to set_index in your datafames. You can use merge with the on parameter and how='left'.

df1 = pd.read_excel('Data.xlsx', sheetname= 'Dataset1')
df2 = pd.read_excel('Data.xlsx', sheetname= 'Dataset2')

df3 = df1.merge(df2, how='left', on='ORG_ID')

Output:

   ORG_ID COUNTRY        TOWN  STORE    PRODUCT  PRICE CUSTOMER TYPE PARENT  \
0       1   Spain      Madrid   Pink    Garment    100        T  Fig  Tulip   
1       2  Greece      Chania  White        Toy    200      NaN  NaN    NaN   
2       3     U.K  Manchester    Red    Garment    300      NaN  NaN    NaN   
3       4   Italy        Rome    Red  Accessory    500        Y  Pop   Rose   
4       5   Spain    Marbella   Blue  Accessory     20        A  Pop   Rose   
5       6  Greece      Chania  Green    Garment     25        R  Fig   Lily   
6       7     U.K  Manchester   Pink        Toy     36        H  Pop  Tulip   
7       8   Italy       Siena    Red  Accessory    150        S  Fig   Rose   
8       9   Spain   Barcelona  White        Toy    200      NaN  NaN    NaN   
9      10  Greece       Corfu   Blue  Accessory    500        A  Cry  Tulip   

   REGION  
0  Europe  
1     NaN  
2     NaN  
3  Europe  
4  Europe  
5  Europe  
6  Europe  
7  Europe  
8     NaN  
9  Europe  

Upvotes: 2

Related Questions