benjaminmgross
benjaminmgross

Reputation: 2102

pandas concat DataFrame on different Index

General Issue

I have an arbitrary list of pandas.DataFrame's (let's use 2 to keep the example clear), and I want to concat them on an Index that:

  1. is neither the inner nor the outer join of the existing DataFrames
  2. is a different, separate Index, but only has dates within all the DataFrame's

For example, take the following 2 DataFrame's (note the difference in Index shapes):

In [01]: d1 = pandas.DataFrame( numpy.random.randn(15, 4), 
                                columns = ['a', 'b', 'c', 'd'], 
                                index = pandas.DatetimeIndex(start = '01/01/2001', 
                                                             freq = 'b', 
                                                             periods = 15)
          )

In [02]: d2 = pandas.DataFrame( numpy.random.randn(17, 4), 
                                columns = ['e', 'f', 'g', 'h'], 
                                index = pandas.DatetimeIndex(start = '01/05/2001', 
                                                             freq = 'b', 
                                                             periods = 17)
          )

I would like to join these two DataFrame's on an intersecting Index, such my_index, constructed here:

In [03]: ind = range(0, 10, 2)
In [04]: my_index = d2.index[ind].copy()

So the following result should have the same results as:

In [05]: d1.loc[my_index, :].join(d2.loc[my_index, :] )
Out[65]: 
               a         b         c         d         e         f  \
2001-01-05  1.702556 -0.885554  0.766257 -0.731700 -1.071232  1.806680   
2001-01-09 -0.968689 -0.700311  1.024988 -0.705764  0.804285 -0.337177   
2001-01-11  1.249893 -0.613356  1.975736 -0.093838  0.428004  0.634204   
2001-01-15  0.430000  0.502100  0.194092  0.588685 -0.507332  1.404635   
2001-01-17  1.005721  0.604771 -2.296667  0.157201  1.583537  1.359332   

               g         h  
2001-01-05 -1.183528  1.260880  
2001-01-09  0.352487  0.700853  
2001-01-11  1.060694  0.040667  
2001-01-15 -0.044510  0.565152  
2001-01-17 -0.731624 -0.331027  

Personal Considerations

Because this is for a larger application, and I will have an arbitrary number of DataFrame's I'd like to:

  1. Use existing pandas functionality instead of building my own hack, i.e. reduce( map ( ) ) etc.
  2. Return views of the intersection of the DataFrame's instead of creating copies of the DataFrame's

Upvotes: 1

Views: 2647

Answers (2)

benjaminmgross
benjaminmgross

Reputation: 2102

Different Methods & Their Times (for Completeness)

I've accepted @unutbu's answer, but I thought it might be valuable to show the two functions I created (and @unutbu's) and their different %timeitvalues in case anyone wants to use it:

Create the df_list and my_index:

dfs = []
for i in range(5):
    tmp = pandas.DataFrame( numpy.random.randn(1000, 4), 
                            columns = list(itertools.islice(columns, 4)), 
                            index = pandas.DatetimeIndex(start = '01/01/2000', 
                                                         freq = 'b', 
                                                         periods = 1000)
    )

    dfs.append(tmp)

ind = range(0, 1000, 2)
my_index = tmp.index[ind].copy()

3 Different Implementations

def join_on_index_a(df_list, index):
    return pandas.concat( 
                          map( lambda x: x.reindex(index), df_list), 
                          axis = 1
    )

#@unutbu's implementation
def join_on_index_b(df_list, index):
    result = dfs[0].reindex(index)
    for df in dfs[1:]:
        result = result.join(df, how='inner')
    return result

def join_on_index_c(df_list, index):
    return pandas.concat( map( lambda x: x.loc[index, :], df_list), axis = 1)

The Results Using iPython %timeit

In [49]: %timeit join_on_index_a(dfs, my_index)
1000 loops, best of 3: 1.85 ms per loop

In [50]: %timeit join_on_index_b(dfs, my_index)
100 loops, best of 3: 1.94 ms per loop

In [51]: %timeit join_on_index_c(dfs, my_index)
100 loops, best of 3: 21.5 ms per loop

Upvotes: 1

unutbu
unutbu

Reputation: 879291

I don't think there is an out-of-the-box Pandas function for doing this. However, it's not hard to build your own:

def select_join(dfs, index):
    result = dfs[0].reindex(index)
    for df in dfs[1:]:
        result = result.join(df, how='inner')
    return result

For example,

import numpy as np
import pandas as pd
import string
import itertools as IT

columns = iter(string.letters)
dfs = []
for i in range(3):
    d1 = pd.DataFrame( np.random.randn(15, 4), 
                           columns = list(IT.islice(columns, 4)), 
                           index = pd.DatetimeIndex(start = '01/01/2001', 
                                                    freq = 'b', 
                                                    periods = 15))
    dfs.append(d1)

ind = range(0, 10, 2)
my_index = d1.index[ind].copy()
print(select_join(dfs, my_index))

yields

                   a         b         c         d         e         f  \
2001-01-01  0.228430 -1.154375 -0.612703 -2.760826 -0.877355 -0.071581   
2001-01-03  1.452750  1.341027  0.051486  1.231563  0.428353  1.320172   
2001-01-05 -0.966979 -1.997200 -0.376060 -0.692346 -1.689897  0.549653   
2001-01-09 -0.117443 -0.888103  2.092829 -0.467220 -1.083004 -1.443015   
2001-01-11 -0.168980 -0.152663  0.365618  0.444175 -1.472091 -0.578182   

                   g         h         i         j         k         l  
2001-01-01 -0.098758  0.920457 -1.072377 -0.627720  0.223060  0.903130  
2001-01-03  1.962124  1.134501 -0.209813 -2.309090  0.358121  0.655156  
2001-01-05  1.088195 -1.705393 -0.161167 -0.339617  0.945495  0.220701  
2001-01-09  0.970829  1.931192  0.943150 -1.895580  0.815188 -1.485206  
2001-01-11  0.747193 -1.221069 -0.164531 -0.395197 -0.754051  0.922090  

Regarding the second consideration: It is impossible to return a view if index is arbitrary. The DataFrame stores data (of like dtype) in a NumPy array. When you select arbitrary rows from a NumPy array, space for a new array is allocated and the rows are copied from the original array into the new array. Only when the selection can be expressed as a basic slice is a view returned. This limitation of NumPy -- a very hard limitation to remove! -- bubbles up into Pandas, causing DataFrames to return copies when the index is not expressible as a basic slice.

Upvotes: 2

Related Questions