Sherwin R
Sherwin R

Reputation: 111

Dataframe concatenation with same indexes and different column sizes

How can I concatenate a dictionary of dataframes (index of each dataframe starts with 0. So, there are duplicate indices) with different column sizes. i.e. Few df's have some missing columns and some df's contain all the columns.

Here is an example:

df1

idx col1 col2 col3
0    1     1    1
1    2     2    2

df2

idx col1 col3 
0    1     1    
1    2     2    

df3

idx col1 col2 col3
0    1     1    1
1    2     2    2

desired_result

idx col1 col2 col3
 0    1     1    1
 1    2     2    2
 2    1     nan  1
 3    2     nan  2
 4    1     1    1
 5    2     2    2

So I would like to fill these columns with Nans and reshape it to the dataframe having all columns.

Thanks in advance !

Upvotes: 0

Views: 224

Answers (1)

Gonçalo Peres
Gonçalo Peres

Reputation: 13582

Considering that the dictionary of dataframes looks like the following

df_dict = {
    'df1': pd.DataFrame({'idx': [1, 2], 'col1': [1, 2], 'col2': [1, 2], 'col3': [1, 2]}),
    'df2': pd.DataFrame({'idx': [1, 2], 'col1': [1, 2], 'col3': [1, 2]}),
    'df3': pd.DataFrame({'idx': [1, 2], 'col1': [1, 2], 'col2': [1, 2], 'col3': [1, 2]})
}

One can simply use pandas.concat as follows

df = pd.concat(df_dict.values(), ignore_index=True)

[Out]:

   idx  col1  col2  col3
0    1     1   1.0     1
1    2     2   2.0     2
2    1     1   NaN     1
3    2     2   NaN     2
4    1     1   1.0     1
5    2     2   2.0     2

If one wants the idx column to be the index, one can pass pandas.DataFrame.set_index as follows

df = pd.concat(df_dict.values(), ignore_index=True).set_index('idx')

[Out]:

     col1  col2  col3
idx                  
1       1   1.0     1
2       2   2.0     2
1       1   NaN     1
2       2   NaN     2
1       1   1.0     1
2       2   2.0     2

Notes:

  • If one doesn't want col2 to be float64 (following this user's comment), one approach would be to use pandas.Int64Dtype as follows

    df['col2'] = df['col2'].astype(pd.Int64Dtype())
    
    [Out]:
    
       idx  col1  col2  col3
    0    1     1     1     1
    1    2     2     2     2
    2    1     1  <NA>     1
    3    2     2  <NA>     2
    4    1     1     1     1
    5    2     2     2     2
    

Upvotes: 1

Related Questions