Reputation: 5367
I have two dataframes.
df1contains number and timestamps. It is a very large set.
df1.head()
Out[292]:
2016-08-31 08:09:00 1.0
2016-08-31 08:11:00 7.0
2016-08-31 08:14:00 90.0
df2 contains intervals
d2.head()
Out[292]:
start stop C
2016-08-31 08:09:00 2016-08-31 08:12:00 'a'
2016-08-31 08:13:00 2016-08-31 08:20:00 'b'
2016-08-31 08:20:00 2016-08-31 08:45:00 'c'
I would like to add a new column C
to df1
such that the value of C
is corresponding to the value in df2
for the interval which contained the index of df1
.
Expected result
df1.head()
Out[292]: C
2016-08-31 08:09:00 1.0 'a'
2016-08-31 08:11:00 7.0 'a'
2016-08-31 08:14:00 90.0 'b'
So far, I have tried:
df1.loc[:,'C']=df1.index.map(lambda i:df2[np.logical_and(i>df2.starti<df2.stop)].C)
but it is extremely inefficient and crashes for some cases where there are values of the index which are missing from the interval list in df2.
How to do this efficiently?
Upvotes: 2
Views: 444
Reputation: 863531
Create IntervalIndex
by IntervalIndex.from_arrays
first:
s = pd.IntervalIndex.from_arrays(df2['start'], df2['stop'], 'both')
print (s)
IntervalIndex([[2016-08-31 08:09:00, 2016-08-31 08:12:00],
[2016-08-31 08:13:00, 2016-08-31 08:20:00],
[2016-08-31 08:20:00, 2016-08-31 08:45:00]],
closed='both',
dtype='interval[datetime64[ns]]')
Then set_index
by new IntervalIndex
set to new column by array created by values
:
df1['C'] = df2.set_index(s).loc[df1.index, 'C'].values
print (df1)
A C
2016-08-31 08:09:00 1.0 a
2016-08-31 08:11:00 7.0 a
2016-08-31 08:14:00 90.0 b
EDIT:
s = pd.IntervalIndex.from_arrays(df2['start'].astype(np.int64),
df2['stop'].astype(np.int64), 'both')
print (s)
IntervalIndex([[1472630940000000000, 1472631120000000000],
[1472631180000000000, 1472631600000000000],
[1472631600000000000, 1472633100000000000]],
closed='both',
dtype='interval[int64]')
df1['C'] = df2.set_index(s).loc[df1.index.astype(np.int64), 'C'].values
print (df1)
A C
2016-08-31 08:09:00 1.0 a
2016-08-31 08:11:00 7.0 a
2016-08-31 08:14:00 90.0 b
Upvotes: 5