Louis Thibault
Louis Thibault

Reputation: 21400

How can I check if the values in a series are contained in any of the intervals defined the rows of a DataFrame?

I realize my title is a bit confusing, but I think I can make it clearer if we proceed by example. What I want to do is a vectorized test to check if any of the values in a given series is contained in any of the intervals defined by a DataFrame object with a start and stop column.

Consider the series, valid, which is the column of a DataFrame called trials. Here is what trials Looks like:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 156 entries, 0 to 155
Data columns (total 3 columns):
start    156  non-null values
stop     156  non-null values
valid    156  non-null values
dtypes: bool(1), float64(2)

I have a separate DataFrame called 'blink`. It has three columns:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41 entries, 0 to 40
Data columns (total 3 columns):
tstart    41  non-null values
tstop     41  non-null values
dur       41  non-null values
dtypes: bool(1), float64(2)

The last column is not directly relevant: it's the duration of the eyeblik, i.e. the difference betwee tstop and tstart.

I would like to set each row of trials['valid'] to False if the interval between it's corresponding trials['start'] to trials['stop'] overlaps with any of the blink['tstart'] to blink['tstop'] intervals.

I could iterate through the rows and use np.arange along with the in operator to do this in a nested loop, but it literally takes hours (my actual data set is much larger than this dummy example). Is there a vectorized approach I could use? If not, is there a faster iteration-based approach?

If anything is unclear, I'll of course be happy to provide additional details.

Upvotes: 1

Views: 2440

Answers (2)

Jeff
Jeff

Reputation: 128948

Your blink data

In [27]: blink = pd.DataFrame(dict(tstart = [0,10], tstop = [5,15]))

In [28]: blink_s = blink.stack()

In [29]: blink_s.index = [ "%s_%s" % (v,i) for i, v in blink_s.index ]

Construct a series of of the blink (kind of like pivoting), but we need new names

In [37]: blink_s
Out[37]: 
tstart_0     0
tstop_0      5
tstart_1    10
tstop_1     15

The trial data

In [30]: trial = pd.DataFrame(dict(start = [3,7,12],stop=[4,10,16]))

Tile the blink_s across rows of the trial

In [32]: blink_df = pd.DataFrame([ blink_s for i in trial.index ])

Join em up

In [33]: x = trial.join(blink_df)

In [34]: x
Out[34]: 
   start  stop  tstart_0  tstop_0  tstart_1  tstop_1
0      3     4         0        5        10       15
1      7    10         0        5        10       15
2     12    16         0        5        10       15

Your answer is then a vectorized boolean expression (this maybe be a long one, so you should programatically generate it, but its not that complicated to do that)

In [35]: x['valid'] = ((x.start>x.tstart_0) & (x.stop<=x.tstop_0)) | ((x.start>x.tstart_1)&(x.stop<=x.tstop_1))

In [36]: x
Out[36]: 
   start  stop  tstart_0  tstop_0  tstart_1  tstop_1  valid
0      3     4         0        5        10       15   True
1      7    10         0        5        10       15  False
2     12    16         0        5        10       15  False

This will work if you want to have float data as your tstart/tstop criteria. If you restrict the intervals to only int data, then the solution is a bit simplier, as instead of doing all this, you can just create a single series of the values that are included (like blink_s), and just do isin.

In essence you are flattening the blink frame to a series that you then can apply to each of the trial

Using Isin (and OP data):

Convert to int64 data

trial = pd.load('trials.pickle').reindex(columns=['start','stop']).astype('int64')
blink = pd.load('blink.pickle').astype('int64')

Add in a row that we know is ni the range

trial = trial.append(pd.Series(dict(start=1000,stop=1200)),ignore_index=True)

Construct the range of values which we want to test

selections = []
for r in blink.iterrows():
    e = r[1]
    selections.extend(list(np.arange(e['tstart'],e['tstop'])))
selections = pd.Series(selections)

Return true if the passed start/stop are in the selection range

def f(s):
    return s.isin(selections).all()
trial['valid'] = trial.apply(f,axis=1)

trial[trial.valid]

I inserted 1 row that I knew would pass, no other rows pass

     start  stop valid
156   1000  1200  True

Upvotes: 1

waitingkuo
waitingkuo

Reputation: 93754

Assume the size of trials is m and the size of blink is n.

First, sort the blink by tstart before you check each row in trials, merge the overlapped ones, it takes O(n log n), take a look at this

While you check whether a start/stop pair is valid or not, follow the following algorithm.

  1. Use the binary search to insert the start in the sorted tstarts
  2. Use the binary search to insert the stop in the sorted tstarts
  3. If there's any tstart between start and stop, return True
  4. Find the the one just before the start, find its tstop, if they overlap with the start/stop pair, return True
  5. Return False

Above algorithm might help you reduce the time complexity to check whether a start/stop pair overlaps with any pair in blink or not from O(n) to O(log n) where n is the length of blink

The time complexity will reduce from O(mn) to O(m log n) + O(n log n). If m >> log n and n is large, that might help you a lot.

Upvotes: 0

Related Questions