00__00__00
00__00__00

Reputation: 5367

merge dataframes with timestamps and intervals

I have two dataframes.

df1contains number and timestamps. It is a very large set.

df1.head()
Out[292]: 
2016-08-31 08:09:00     1.0
2016-08-31 08:11:00     7.0
2016-08-31 08:14:00    90.0

df2 contains intervals

d2.head()
Out[292]:        
   start                 stop                         C
2016-08-31 08:09:00     2016-08-31 08:12:00         'a'
2016-08-31 08:13:00     2016-08-31 08:20:00         'b'
2016-08-31 08:20:00     2016-08-31 08:45:00         'c'

I would like to add a new column C to df1 such that the value of C is corresponding to the value in df2 for the interval which contained the index of df1.

Expected result

 df1.head()
    Out[292]:                      C
    2016-08-31 08:09:00     1.0   'a'
    2016-08-31 08:11:00     7.0   'a'
    2016-08-31 08:14:00    90.0   'b'

So far, I have tried:

 df1.loc[:,'C']=df1.index.map(lambda i:df2[np.logical_and(i>df2.starti<df2.stop)].C)

but it is extremely inefficient and crashes for some cases where there are values of the index which are missing from the interval list in df2.

How to do this efficiently?

Upvotes: 2

Views: 444

Answers (1)

jezrael
jezrael

Reputation: 863531

Create IntervalIndex by IntervalIndex.from_arrays first:

s = pd.IntervalIndex.from_arrays(df2['start'], df2['stop'], 'both')
print (s)
IntervalIndex([[2016-08-31 08:09:00, 2016-08-31 08:12:00], 
               [2016-08-31 08:13:00, 2016-08-31 08:20:00],
               [2016-08-31 08:20:00, 2016-08-31 08:45:00]],
              closed='both',
              dtype='interval[datetime64[ns]]')

Then set_index by new IntervalIndex set to new column by array created by values:

df1['C'] = df2.set_index(s).loc[df1.index, 'C'].values
print (df1)
                        A  C
2016-08-31 08:09:00   1.0  a
2016-08-31 08:11:00   7.0  a
2016-08-31 08:14:00  90.0  b

EDIT:

s = pd.IntervalIndex.from_arrays(df2['start'].astype(np.int64), 
                                 df2['stop'].astype(np.int64), 'both')
print (s)
IntervalIndex([[1472630940000000000, 1472631120000000000], 
               [1472631180000000000, 1472631600000000000], 
               [1472631600000000000, 1472633100000000000]],
              closed='both',
              dtype='interval[int64]')

df1['C'] = df2.set_index(s).loc[df1.index.astype(np.int64), 'C'].values
print (df1)
                        A  C
2016-08-31 08:09:00   1.0  a
2016-08-31 08:11:00   7.0  a
2016-08-31 08:14:00  90.0  b

Upvotes: 5

Related Questions