How can I check if the values in a series are contained in any of the intervals defined the rows of a DataFrame?

Question

I realize my title is a bit confusing, but I think I can make it clearer if we proceed by example. What I want to do is a vectorized test to check if any of the values in a given series is contained in any of the intervals defined by a DataFrame object with a start and stop column.

Consider the series, valid, which is the column of a DataFrame called trials. Here is what trials Looks like:


Int64Index: 156 entries, 0 to 155
Data columns (total 3 columns):
start    156  non-null values
stop     156  non-null values
valid    156  non-null values
dtypes: bool(1), float64(2)

I have a separate DataFrame called 'blink`. It has three columns:


Int64Index: 41 entries, 0 to 40
Data columns (total 3 columns):
tstart    41  non-null values
tstop     41  non-null values
dur       41  non-null values
dtypes: bool(1), float64(2)

The last column is not directly relevant: it's the duration of the eyeblik, i.e. the difference betwee tstop and tstart.

I would like to set each row of trials['valid'] to False if the interval between it's corresponding trials['start'] to trials['stop'] overlaps with any of the blink['tstart'] to blink['tstop'] intervals.

I could iterate through the rows and use np.arange along with the in operator to do this in a nested loop, but it literally takes hours (my actual data set is much larger than this dummy example). Is there a vectorized approach I could use? If not, is there a faster iteration-based approach?

If anything is unclear, I'll of course be happy to provide additional details.

Jeff · Accepted Answer

Your blink data

In [27]: blink = pd.DataFrame(dict(tstart = [0,10], tstop = [5,15]))

In [28]: blink_s = blink.stack()

In [29]: blink_s.index = [ "%s_%s" % (v,i) for i, v in blink_s.index ]

Construct a series of of the blink (kind of like pivoting), but we need new names

In [37]: blink_s
Out[37]: 
tstart_0     0
tstop_0      5
tstart_1    10
tstop_1     15

The trial data

In [30]: trial = pd.DataFrame(dict(start = [3,7,12],stop=[4,10,16]))

Tile the blink_s across rows of the trial

In [32]: blink_df = pd.DataFrame([ blink_s for i in trial.index ])

Join em up

In [33]: x = trial.join(blink_df)

In [34]: x
Out[34]: 
   start  stop  tstart_0  tstop_0  tstart_1  tstop_1
0      3     4         0        5        10       15
1      7    10         0        5        10       15
2     12    16         0        5        10       15

Your answer is then a vectorized boolean expression (this maybe be a long one, so you should programatically generate it, but its not that complicated to do that)

In [35]: x['valid'] = ((x.start>x.tstart_0) & (x.stop<=x.tstop_0)) | ((x.start>x.tstart_1)&(x.stop<=x.tstop_1))

In [36]: x
Out[36]: 
   start  stop  tstart_0  tstop_0  tstart_1  tstop_1  valid
0      3     4         0        5        10       15   True
1      7    10         0        5        10       15  False
2     12    16         0        5        10       15  False

This will work if you want to have float data as your tstart/tstop criteria. If you restrict the intervals to only int data, then the solution is a bit simplier, as instead of doing all this, you can just create a single series of the values that are included (like blink_s), and just do isin.

In essence you are flattening the blink frame to a series that you then can apply to each of the trial

Using Isin (and OP data):

Convert to int64 data

trial = pd.load('trials.pickle').reindex(columns=['start','stop']).astype('int64')
blink = pd.load('blink.pickle').astype('int64')

Add in a row that we know is ni the range

trial = trial.append(pd.Series(dict(start=1000,stop=1200)),ignore_index=True)

Construct the range of values which we want to test

selections = []
for r in blink.iterrows():
    e = r[1]
    selections.extend(list(np.arange(e['tstart'],e['tstop'])))
selections = pd.Series(selections)

Return true if the passed start/stop are in the selection range

def f(s):
    return s.isin(selections).all()
trial['valid'] = trial.apply(f,axis=1)

trial[trial.valid]

I inserted 1 row that I knew would pass, no other rows pass

     start  stop valid
156   1000  1200  True

How can I check if the values in a series are contained in any of the intervals defined the rows of a DataFrame?

Answers (2)

Related Questions