mati
mati

Reputation: 1145

How to merge/join/concat dataframes on index in pandas

I have a main dataframe df1 which is 'empty' and some other dataframes of different length but with same columns as df1. I would like to merge/join/concat df2 and df3 on certain index positions of df1:

df1 = pd.DataFrame(index=(np.arange(20)), columns=['A', 'B', 'C', 'D'])  
df2 = pd.DataFrame(randn(5, 4), columns=['A', 'B', 'C', 'D'])  
df3 = pd.DataFrame(randn(2, 4), columns=['A', 'B', 'C', 'D'])  
df4 = pd.concat([df1, df2, df3],axis=0, keys=['','df2','df3'])  

However, concat([df1,df2,df3], axis=0, keys....) puts the dfs consecutively together...

As an example df2 should start at df1.index(3) and df3 at df1.index(12) leaving the NANs of df1 before, between and after df2 and df3 in place.

Upvotes: 0

Views: 1153

Answers (1)

EdChum
EdChum

Reputation: 394041

You won't be able to achieve what you want using concat like this or merge for that matter without reindexing. By default for concat the original index values will be used and will just stack the dfs after each other. If you tried to merge using the indices then they will clash and create additonal columns e.g. 'X_x, 'X_y' etc..

The easiest way is to reindex the dfs and then call update to overwrite the NaNs at the desired rows:

In [55]:

df1 = pd.DataFrame(index=(np.arange(20)), columns=['A', 'B', 'C', 'D'])  
df2 = pd.DataFrame(randn(5, 4), columns=['A', 'B', 'C', 'D'], index=arange(3,8))  
df3 = pd.DataFrame(randn(2, 4), columns=['A', 'B', 'C', 'D'], index=arange(12,14))  
df1.update(df2 )
df1.update(df3)
df1
Out[55]:
            A          B          C          D
0         NaN        NaN        NaN        NaN
1         NaN        NaN        NaN        NaN
2         NaN        NaN        NaN        NaN
3    1.315786 -0.4132373   1.084278  -0.945918
4    2.663868  0.8093393 -0.9853748  -1.428507
5   0.1414072  0.1896291   1.562247  0.1180398
6    1.699411   1.518535 -0.1854568  -0.430911
7   -1.716514   1.053428  0.3587436   1.471504
8         NaN        NaN        NaN        NaN
9         NaN        NaN        NaN        NaN
10        NaN        NaN        NaN        NaN
11        NaN        NaN        NaN        NaN
12 -0.5668472  -0.137524  0.1769178  0.7065836
13  -1.726548  0.2645793  0.1484438  -1.099373
14        NaN        NaN        NaN        NaN
15        NaN        NaN        NaN        NaN
16        NaN        NaN        NaN        NaN
17        NaN        NaN        NaN        NaN
18        NaN        NaN        NaN        NaN
19        NaN        NaN        NaN        NaN

So you can modify the index after creation like this:

In [56]:

df2 = pd.DataFrame(randn(5, 4), columns=['A', 'B', 'C', 'D'])
print(df2)
df2.index = index=arange(3,8)
df2
          A         B         C         D
0 -0.417199  0.184960 -1.056915  0.674005
1  0.592654  1.131421 -0.958991  0.831648
2 -1.087500  0.664596 -1.508290  2.453404
3  1.441886 -0.131998  0.040513 -2.675875
4  1.547340 -0.230203  1.335917 -1.568939
Out[56]:
          A         B         C         D
3 -0.417199  0.184960 -1.056915  0.674005
4  0.592654  1.131421 -0.958991  0.831648
5 -1.087500  0.664596 -1.508290  2.453404
6  1.441886 -0.131998  0.040513 -2.675875
7  1.547340 -0.230203  1.335917 -1.568939

Upvotes: 1

Related Questions