Reputation: 35
I am trying to make a large dataframe using python. I have a large amount of little dataframes with different row and column names, but there is some overlap between the row names and column names. What I was trying to do is start with one of the little dataframes and then one by one add the others.
Each of the specific row-column combinations is unique and in the end there will probably be a lot of NA.
I have tried doing this with merge from pandas, but this results in a much larger dataframe than I need with row and column names being duplicated instead of merged. If I could find a way that pandas realises that NaN is not a value and overwrites it when a new little dataframe is added, I think I would obtain the result I want. I am also willing to try something that is not using pandas.
For example:
DF1 A B
Y 1 2
Z 0 1
DF2 C D
X 1 2
Z 0 1
Merged: A B C D
Y 1 2 NA NA
Z 0 1 0 1
X NA NA 1 2
And then a new dataframe has to be added:
DF3 C E
Y 0 1
W 1 1
The result should be:
A B C D E
Y 1 2 0 NA 1
Z 0 1 0 1 NA
X NA NA 1 2 NA
W NA NA 1 NA 1
But what happens is:
A B C_x C_y D E
Y 1 2 NA 1 NA 1
Z 0 1 0 0 1 NA
X NA NA 1 1 2 NA
W NA NA 1 1 NA 1
Upvotes: 3
Views: 2617
Reputation: 59549
You want to use DataFrame.combine_first
, which will align the DataFrames based on index, and will prioritize values in the left DataFrame, while using values in the right DataFrame to fill missing values.
df1.combine_first(df2).combine_first(df3)
import pandas as pd
df1 = pd.DataFrame({'A': [1,0], 'B': [2,1]})
df1.index=['Y', 'Z']
df2 = pd.DataFrame({'C': [1,0], 'D': [2,1]})
df2.index=['X', 'Z']
df3 = pd.DataFrame({'C': [0,1], 'E': [1,1]})
df3.index=['Y', 'W']
df1.combine_first(df2).combine_first(df3)
A B C D E
W NaN NaN 1.0 NaN 1.0
X NaN NaN 1.0 2.0 NaN
Y 1.0 2.0 0.0 NaN 1.0
Z 0.0 1.0 0.0 1.0 NaN
Upvotes: 1