Reputation: 23
I am looking to merge 2 dataframes in python, df1 and df2, on 2 columns, Site and Building, with different row quantities as a way of attaining a "safe" value for each generator value in df1. Below is a demonstration code, though I've created the dataframes in the below example (which appears to work), the data for each table in the actual issue comes from SQL queries, which leads me to believe the merge is having issues due to data type.
import pandas as pd
df = {'Site': ['Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Greece','Greece','Greece','Greece','Greece','Greece'],
'Building' : ['X1','X1','X1','X1','X1','X1','X2','X2','X2','X2','X2','X2','X3','X3','X3','X3','X3','X3','X4','X4','X4','X4','X4', 'X4','X5','X5','X5','X5','X5','X5','X1','X1', 'X1','X1', 'X1','X1','X2','X2','X2','X2','X2','X2','X3','X3','X3','X3','X3','X3','X1', 'X1','X1', 'X1','X1', 'X1'],
'Generator' : ['DE','NDE', 'GBX1','GBX2','GBX3','GBX4','DE','NDE','GBX1','GBX2','GBX3','GBX4','DE','NDE', 'GBX1','GBX2','GBX3','GBX4','DE','NDE','GBX1','GBX2','GBX3','GBX4','DE','NDE','GBX1','GBX2','GBX3','GBX4', 'DE','NDE','GBX1','GBX2','GBX3','GBX4','DE', 'NDE','GBX1','GBX2','GBX3','GBX4', 'DE','NDE','GBX1','GBX2','GBX3','GBX4','DE','NDE','GBX1','GBX2','GBX3','GBX4']}
df1 = pd.DataFrame(df1, columns = ['Site', 'Building', 'Generator'])
df15 = {'Building' : ['X1','X2','X3','X4','X5','X1','X2','X3','X1'],
'Site': ['Belgium','Belgium','Belgium','Belgium','Belgium','Holland','Holland','Holland','Greece'],
'Safe' : [1, 1, 1, 1, 1, 0, 1, 1, 0]}
df2 = pd.DataFrame(df15, columns = ['Site', 'Building', 'Safe'])
df3 = df1.merge(df2, how = 'left', on = ['Site', 'Building'], indicator = True)
I've also tried changing the data types of each to string as per pandas - Merging on string columns not working (bug?),
df1['Site'] = df1['Site'].astype('str')
df1['Building']=df1['Building'].astype('str')
df1['Site'] = df1['Site'].astype('str')
df1['Building']=df1['Building'].astype('str')
and also the step that mentions checking encoding as per below, but all seem to match ie. there are no visible dependencies in content;
df1['Building'] = df1['Building'].str.encode('UTF-8')
df1['Site'] = df1['Site'].str.encode('UTF-8')
Datatypes:
df2.datatypes:
Site object
Building object
Safe object
dtype: object
df1.datatypes:
Building object
Site object
Generator object
dtype: object
I've tried the below code:
df3 = df1.merge(df2, left_on = ['Site', 'Building'], right_on = ['Site', 'Building'], how = 'left', indicator = 'indicator')
or:
df3 = df1.merge(df2, on = ['Site', 'Building'], how = 'left', indicator = 'indicator')
but the outcome ends up with only the data from left ie outcome 1.
I've tried outer join as per below, which yields outcome 2:
df3 = df1.merge(df2, on = ['Site', 'Building'], how = 'outer', indicator = 'indicator')
Apologies for my relative ignorance with respect to pandas.
Upvotes: 0
Views: 70
Reputation: 687
I noticed a small error in the code you shared.
df1 = pd.DataFrame(df1, columns = ['Site', 'Building', 'Generator'])
should be df1 = pd.DataFrame(df, columns = ['Site', 'Building', 'Generator'])
. The variable that you should be passing to pd.Dataframe should be df
and not df1
.
After this step, simply doing a merge using the dataframes gives the desired outcome
pd.merge(df1,df2, on=['Building','Site'])
Output looks like :
Site Building Generator Safe
0 Belgium X1 DE 1
1 Belgium X1 NDE 1
2 Belgium X1 GBX1 1
3 Belgium X1 GBX2 1
4 Belgium X1 GBX3 1
5 Belgium X1 GBX4 1
6 Belgium X2 DE 1
7 Belgium X2 NDE 1
8 Belgium X2 GBX1 1
9 Belgium X2 GBX2 1
10 Belgium X2 GBX3 1
11 Belgium X2 GBX4 1
12 Belgium X3 DE 1
Upvotes: 0