Reputation: 21400
I realize my title is a bit confusing, but I think I can make it clearer if we proceed by example. What I want to do is a vectorized test to check if any of the values in a given series is contained in any of the intervals defined by a DataFrame object with a start
and stop
column.
Consider the series, valid
, which is the column of a DataFrame called trials
. Here is what trials
Looks like:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 156 entries, 0 to 155
Data columns (total 3 columns):
start 156 non-null values
stop 156 non-null values
valid 156 non-null values
dtypes: bool(1), float64(2)
I have a separate DataFrame called 'blink`. It has three columns:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 41 entries, 0 to 40
Data columns (total 3 columns):
tstart 41 non-null values
tstop 41 non-null values
dur 41 non-null values
dtypes: bool(1), float64(2)
The last column is not directly relevant: it's the duration of the eyeblik, i.e. the difference betwee tstop
and tstart
.
I would like to set each row of trials['valid']
to False
if the interval between it's corresponding trials['start']
to trials['stop']
overlaps with any of the blink['tstart']
to blink['tstop']
intervals.
I could iterate through the rows and use np.arange
along with the in
operator to do this in a nested loop, but it literally takes hours (my actual data set is much larger than this dummy example). Is there a vectorized approach I could use? If not, is there a faster iteration-based approach?
If anything is unclear, I'll of course be happy to provide additional details.
Upvotes: 1
Views: 2440
Reputation: 128948
Your blink data
In [27]: blink = pd.DataFrame(dict(tstart = [0,10], tstop = [5,15]))
In [28]: blink_s = blink.stack()
In [29]: blink_s.index = [ "%s_%s" % (v,i) for i, v in blink_s.index ]
Construct a series of of the blink (kind of like pivoting), but we need new names
In [37]: blink_s
Out[37]:
tstart_0 0
tstop_0 5
tstart_1 10
tstop_1 15
The trial data
In [30]: trial = pd.DataFrame(dict(start = [3,7,12],stop=[4,10,16]))
Tile the blink_s across rows of the trial
In [32]: blink_df = pd.DataFrame([ blink_s for i in trial.index ])
Join em up
In [33]: x = trial.join(blink_df)
In [34]: x
Out[34]:
start stop tstart_0 tstop_0 tstart_1 tstop_1
0 3 4 0 5 10 15
1 7 10 0 5 10 15
2 12 16 0 5 10 15
Your answer is then a vectorized boolean expression (this maybe be a long one, so you should programatically generate it, but its not that complicated to do that)
In [35]: x['valid'] = ((x.start>x.tstart_0) & (x.stop<=x.tstop_0)) | ((x.start>x.tstart_1)&(x.stop<=x.tstop_1))
In [36]: x
Out[36]:
start stop tstart_0 tstop_0 tstart_1 tstop_1 valid
0 3 4 0 5 10 15 True
1 7 10 0 5 10 15 False
2 12 16 0 5 10 15 False
This will work if you want to have float data as your tstart/tstop criteria. If you restrict the intervals to only int data, then the solution is a bit simplier, as instead of doing all this, you can just create a single series of the values that are included (like blink_s), and just do isin
.
In essence you are flattening the blink frame to a series that you then can apply to each of the trial
Using Isin (and OP data):
Convert to int64 data
trial = pd.load('trials.pickle').reindex(columns=['start','stop']).astype('int64')
blink = pd.load('blink.pickle').astype('int64')
Add in a row that we know is ni the range
trial = trial.append(pd.Series(dict(start=1000,stop=1200)),ignore_index=True)
Construct the range of values which we want to test
selections = []
for r in blink.iterrows():
e = r[1]
selections.extend(list(np.arange(e['tstart'],e['tstop'])))
selections = pd.Series(selections)
Return true if the passed start/stop are in the selection range
def f(s):
return s.isin(selections).all()
trial['valid'] = trial.apply(f,axis=1)
trial[trial.valid]
I inserted 1 row that I knew would pass, no other rows pass
start stop valid
156 1000 1200 True
Upvotes: 1
Reputation: 93754
Assume the size of trials
is m
and the size of blink
is n
.
First, sort the blink
by tstart
before you check each row in trials
, merge the overlapped ones, it takes O(n log n)
, take a look at this
While you check whether a start/stop pair is valid or not, follow the following algorithm.
tstarts
tstarts
tstart
between start and stop, return Truetstop
,
if they overlap with the start/stop pair, return TrueAbove algorithm might help you reduce the time complexity to check whether a start/stop
pair overlaps with any pair in blink
or not from O(n)
to O(log n)
where n is the length of blink
The time complexity will reduce from O(mn)
to O(m log n) + O(n log n)
. If m >> log n
and n
is large, that might help you a lot.
Upvotes: 0