Reputation: 4954
I am looking for an efficient way to combine 100 pandas data frames, which represent a grid of information points. Each of these data frames' points is unique, and does not overlap points represented by another, but they do share columns and rows over a larger patchwork space. i.e.
1 2 3 4 5 6 7 8 9
A df1, df1, df1, df2, df2, df2, df3, df3, df3
B df1, df1, df1, df2, df2, df2, df3, df3, df3
C df1, df1, df1, df2, df2, df2, df3, df3, df3
D df4, df4, df4, df5, df5, df5, etc, etc, etc
E df4, df4, df4, df5, df5, df5, etc, etc, etc
F df4, df4, df4, df5, df5, df5, etc, etc, etc
Pandas' concatenate only combines over either the columns or the row axis, but not both. So I've been trying to increment over the data frames and using the df1.combine_first(df2) method (repeat ad infinitum).
Is this the best way to proceed, or is there another more efficient method that I should be aware of?
Upvotes: 1
Views: 3096
Reputation: 30424
Here's a quick guess at both the convenience and efficiency angles, based on non-overlapping datapoints and assuming very regular data (everything 3x3 in this case).
df1=pd.DataFrame( np.random.randn(3,3), index=list('ABC'), columns=list('123') )
df2=pd.DataFrame( np.random.randn(3,3), index=list('DEF'), columns=list('123') )
df3=pd.DataFrame( np.random.randn(3,3), index=list('ABC'), columns=list('456') )
df4=pd.DataFrame( np.random.randn(3,3), index=list('DEF'), columns=list('456') )
The combine_first
way has the advantage that you can just dump everything in a list without worrying about the order:
%%timeit
comb_df = pd.DataFrame()
for df in [df1,df2,df3,df4]:
comb_df = comb_df.combine_first( df )
100 loops, best of 3: 8.92 ms per loop
The concat
way requires you to group things in a specific order, but is more than twice as fast:
%%timeit
df5 = pd.concat( [df1,df2], axis=0 )
df6 = pd.concat( [df3,df4], axis=0 )
df7 = pd.concat( [df5,df6], axis=1 )
100 loops, best of 3: 3.84 ms per loop
Quick check that both ways work the same:
all( comb_df == df7 )
True
Upvotes: 2