Zoem
Zoem

Reputation: 35

Merge on columns and rows

I am trying to make a large dataframe using python. I have a large amount of little dataframes with different row and column names, but there is some overlap between the row names and column names. What I was trying to do is start with one of the little dataframes and then one by one add the others.

Each of the specific row-column combinations is unique and in the end there will probably be a lot of NA.

I have tried doing this with merge from pandas, but this results in a much larger dataframe than I need with row and column names being duplicated instead of merged. If I could find a way that pandas realises that NaN is not a value and overwrites it when a new little dataframe is added, I think I would obtain the result I want. I am also willing to try something that is not using pandas.

For example:
DF1         A    B
        Y   1    2
        Z   0    1


DF2         C    D
        X   1    2
        Z   0    1

Merged:     A    B    C    D
        Y   1    2    NA   NA
        Z   0    1    0    1
        X  NA    NA   1    2

And then a new dataframe has to be added:
DF3         C    E
        Y   0    1
        W   1    1

The result should be:
            A    B    C    D    E
        Y   1    2    0    NA   1
        Z   0    1    0    1    NA
        X   NA   NA   1    2    NA
        W   NA   NA   1    NA   1

But what happens is:
            A    B    C_x    C_y    D    E
        Y   1    2    NA     1      NA   1
        Z   0    1    0      0      1    NA
        X   NA   NA   1      1      2    NA
        W   NA   NA   1      1      NA   1

Upvotes: 3

Views: 2617

Answers (1)

ALollz
ALollz

Reputation: 59549

You want to use DataFrame.combine_first, which will align the DataFrames based on index, and will prioritize values in the left DataFrame, while using values in the right DataFrame to fill missing values.

df1.combine_first(df2).combine_first(df3)

Sample data

import pandas as pd
df1 = pd.DataFrame({'A': [1,0], 'B': [2,1]})
df1.index=['Y', 'Z']

df2 = pd.DataFrame({'C': [1,0], 'D': [2,1]})
df2.index=['X', 'Z']

df3 = pd.DataFrame({'C': [0,1], 'E': [1,1]})
df3.index=['Y', 'W']

Code

df1.combine_first(df2).combine_first(df3)

Output:

     A    B    C    D    E
W  NaN  NaN  1.0  NaN  1.0
X  NaN  NaN  1.0  2.0  NaN
Y  1.0  2.0  0.0  NaN  1.0
Z  0.0  1.0  0.0  1.0  NaN

Upvotes: 1

Related Questions