merge dataframes with timestamps and intervals

Question

I have two dataframes.

df1contains number and timestamps. It is a very large set.

df1.head()
Out[292]: 
2016-08-31 08:09:00     1.0
2016-08-31 08:11:00     7.0
2016-08-31 08:14:00    90.0

df2 contains intervals

d2.head()
Out[292]:        
   start                 stop                         C
2016-08-31 08:09:00     2016-08-31 08:12:00         'a'
2016-08-31 08:13:00     2016-08-31 08:20:00         'b'
2016-08-31 08:20:00     2016-08-31 08:45:00         'c'

I would like to add a new column C to df1 such that the value of C is corresponding to the value in df2 for the interval which contained the index of df1.

Expected result

 df1.head()
    Out[292]:                      C
    2016-08-31 08:09:00     1.0   'a'
    2016-08-31 08:11:00     7.0   'a'
    2016-08-31 08:14:00    90.0   'b'

So far, I have tried:

 df1.loc[:,'C']=df1.index.map(lambda i:df2[np.logical_and(i>df2.starti



but it is extremely inefficient and crashes for some cases where there are values of the index which are missing from the interval list in df2.

How to do this efficiently?

jezrael · Accepted Answer

Create IntervalIndex by IntervalIndex.from_arrays first:

s = pd.IntervalIndex.from_arrays(df2['start'], df2['stop'], 'both')
print (s)
IntervalIndex([[2016-08-31 08:09:00, 2016-08-31 08:12:00], 
               [2016-08-31 08:13:00, 2016-08-31 08:20:00],
               [2016-08-31 08:20:00, 2016-08-31 08:45:00]],
              closed='both',
              dtype='interval[datetime64[ns]]')

Then set_index by new IntervalIndex set to new column by array created by values:

df1['C'] = df2.set_index(s).loc[df1.index, 'C'].values
print (df1)
                        A  C
2016-08-31 08:09:00   1.0  a
2016-08-31 08:11:00   7.0  a
2016-08-31 08:14:00  90.0  b

EDIT:

s = pd.IntervalIndex.from_arrays(df2['start'].astype(np.int64), 
                                 df2['stop'].astype(np.int64), 'both')
print (s)
IntervalIndex([[1472630940000000000, 1472631120000000000], 
               [1472631180000000000, 1472631600000000000], 
               [1472631600000000000, 1472633100000000000]],
              closed='both',
              dtype='interval[int64]')

df1['C'] = df2.set_index(s).loc[df1.index.astype(np.int64), 'C'].values
print (df1)
                        A  C
2016-08-31 08:09:00   1.0  a
2016-08-31 08:11:00   7.0  a
2016-08-31 08:14:00  90.0  b

merge dataframes with timestamps and intervals

Answers (1)

Related Questions