Reputation: 91
I have two dataframes (excel sheets) and I want to select only certain columns from a resulting pd.merge
df1 = pd.read_excel('Receivables.xlsx')
#the sheet looks like the table below
ID | Dollar Type | Q1 | Q2 | Q3 | Q4 |
---|---|---|---|---|---|
1 | Receivables | 10 | 20 | 30 | 44 |
2 | Receivables | 11 | 21 | 41 | 54 |
df2 = pd.read_excel('Payables.xlsx')
#the sheet looks like the table below
Payable_ID | Dollar Type | Q1 | Q2 | Q3 | Q4 |
---|---|---|---|---|---|
1 | Payables | 10 | -20 | -30 | -44 |
2 | Payables | 11 | -21 | -41 | -54 |
My merge then looks like this
leftJoin = df1.merge(df2, left_on=['ID'], right_on=['Payable_ID'], how='left')
How do I only select columns Dollar Type
, Q1
, Q2
, Q3
, Q4
from df1
and select Payable_ID
, Q1
, Q2
from df2. The problem I'm running into is that the duplicate columns are appended with either _x or _y. While I get why it's doing that, is there a way to turn off this appending?
In SQL, there's no need for me to specify, for example, df1.[Dollar Type_x] or df2.[Q1_y]
SELECT DF1.[DOLLAR TYPE],
DF1.[Q1],
DF1.[Q2],
DF1.[Q3],
DF1.[Q4],
DF2.[PAYABLE_ID],
DF2.[Q1],
DF2.[Q2]
FROM DF1 LEFT JOIN DF2 ON DF1.ID = DF.PAYABLE_ID
Upvotes: 1
Views: 854
Reputation: 26
You can pass only selected columns of the DataFrames to merge:
leftJoin = df1[['ID','Dollar Type','Q1','Q2','Q3','Q4']].merge(df2[['Payable_ID','Q1','Q2']], left_on=['ID'], right_on=['Payable_ID'], how='left')
Now, you have undesired 'ID' column in the merged DataFrame which you can drop using
leftJoin.drop(columns=['ID'], inplace=True)
You can also make the drop inline with the merge:
leftJoin = df1[['ID','Dollar Type','Q1','Q2','Q3','Q4']].merge(df2[['Payable_ID','Q1','Q2']], left_on=['ID'], right_on=['Payable_ID'], how='left').drop(columns=['ID'], inplace=True)
To change the names of the columns, you can use rename:
leftJoin.rename(columns={'Q1_x':'Q1', 'Q2_x':'Q2','Q1_y':'Q5','Q2_y':'Q6'}, inplace=True)
Which again can be done in the same line:
leftJoin = df1[['ID','Dollar Type','Q1','Q2','Q3','Q4']].merge(df2[['Payable_ID','Q1','Q2']], left_on=['ID'], right_on=['Payable_ID'], how='left').drop(columns=['ID'], inplace=True).rename(columns={'Q1_x':'Q1', 'Q2_x':'Q2','Q1_y':'Q5','Q2_y':'Q6'}, inplace=True)
Upvotes: 1