Reputation: 157
I want to match two pandas Dataframes by the name of their columns.
import pandas as pd
df1 = pd.DataFrame([[0,2,1],[1,3,0],[0,4,0]], columns=['A', 'B', 'C'])
A B C
0 0 2 1
1 1 3 0
2 0 4 0
df2 = pd.DataFrame([[0,0,1],[1,5,0],[0,7,0]], columns=['A', 'B', 'D'])
A B D
0 0 0 1
1 1 5 0
2 0 7 0
If the names match, do nothing. (Keep the column of df2)
If a column is in Dataframe 1 but not in Dataframe 2, add the column in Dataframe 2 as a vector of zeros.
If a column is in Dataframe 2 but not in Dataframe 1, drop it.
The output should look like this:
A B C
0 0 0 0
1 1 5 0
2 0 7 0
I know if I do:
df2 = df2[df1.columns]
I get:
KeyError: "['C'] not in index"
I could also add the vectors of zeros manually, but of course this is a toy example of a much longer dataset. Is there any smarter/pythonic way of doing this?
Upvotes: 1
Views: 497
Reputation: 35626
It appears that df2
columns should be the same as df1
columns after this operation, as columns that are in df1
and not df2
should be added, while columns only in df2
should be removed. We can simply reindex
df2
to match df1
columns with a fill_value=0
(this is the safe equivalent to df2 = df2[df1.columns]
when adding new columns with a fill value):
df2 = df2.reindex(columns=df1.columns, fill_value=0)
df2
:
A B C
0 0 0 0
1 1 5 0
2 0 7 0
Upvotes: 2