Filter a dataframe column based on values in a second column being within a tolerance value of any rows in a second dataframe

Question

I am dealing with experimental measurements of time-correlated gamma-ray emissions with a pair of detectors. I have a long-form dataframe which lists every gamma-ray detected and displays its energy, the timestamp of the event, and the detector channel. Here is the sample structure of this dataframe:

df
    Energy  Timestamp       Channel
0   639     753437128196030 1
1   798     753437128196010 2
2   314     753437131148580 1
3   593     753437131148510 2
4   2341    753437133607800 1

I must filter these data and according to the following conditions: Return the energies of all events detected in Channel 1 which occur within one user-selectable timing_window of events detected in Channel 2. Furthermore, only the events in Channel 2 that are within the energy range [E_lo, E_hi] should be considered when evaluating the timing window conditions.

So far, I have tried the following:

Separate the energy data of each detector into individual dataframes:

d1_all = df[(df["Channel"] == 1)]
d2_all = df[(df["Channel"] == 2)]

Reset the indices of d1_all:

d1_all = d1_all.reset_index()
d1_all = d1_all.drop(['index'], axis=1)
d1_all.head()

Retain only the events in d2_all which occur in the range [E_lo=300, E_hi=600]

d2_gate = d2_all[(d2_all["Energy"] >= 300) & (d2_all["Energy"] <=600)]

Reset the indices of d2_all:

d2_gate = d2_gate.reset_index()
d2_gate = d2_gate.drop(['index'], axis=1)
d2_gate.head()

Everything up to this point works fine. Here is the biggest problem. The following code evaluates each event in detector 1 to determine if its timestamp is within one timing_window of the timestamp corresponding to ANY event within the energy range E_lo to E_hi in detector 2. The problem is that this dataframe can have on the order of 10's to 100's of thousands of entries for each detector, and the current code takes essentially forever to run. This code uses nested for loops.

for i in range(0, d1_all.shape[0]):
    coincidence = False
    for j in range(0, d2_gate.shape[0]):
        if ((d1_all.iloc[i]["Timestamp"]) >= 
            (d2_gate.iloc[j]["Timestamp"] - coin_window)) and ((d1_all.iloc[i] 
            ["Timestamp"]) <= (d2_gate.iloc[j]["Timestamp"] + coin_window)):
            coincidence = True
            break
        else:
            pass
    if coincidence == True:
        pass
    elif coincidence == False:
        d1_all = d1_all.drop([i])

Any help identifying a faster implementation of evaluating for coincidences would be greatly appreciated! Thank you!

Alexander · Accepted Answer

Perhaps this will work? As you have done, it first splits the data into two dataframes corresponding to the channel. The second dataframe also filters for the energy between the max and min levels.

I then create a numpy array of start and end times, corresponding to the timestamps in df2 -/+ the window.

window = 10000  # For example.
min_energy = 300
max_energy = 600

df1 = df[df['Channel'].eq(1)]
df2 = df.loc[df['Channel'].eq(2) 
             & df['Energy'].ge(min_energy) 
             & df['Energy'].le(max_energy)]


start = np.array(df2['Timestamp'] - window)
end = np.array(df2['Timestamp'] + window)

df1[df1['Timestamp'].apply(lambda ts: ((start <= ts) & (ts <= end)).any())]

To explain the lambda function, I'll provide the following sample data (timestamps units are to make them more readable):

df = pd.DataFrame({
    'Energy': [639, 798, 314, 593, 2341, 550, 625],
    'Timestamp': [10, 20, 28, 30, 40, 50, 51],
    'Channel': [1, 2, 1, 2, 1, 2, 1]
})

After applying the code above:

>>> df1
   Energy  Timestamp  Channel
0     639         10        1
2     314         28        1
4    2341         40        1
6     625         51        1

>>> df2
   Energy  Timestamp  Channel
3     593         30        2
5     550         50        2

I use a window of 3 for this example, which gives the following start and end times based on the timestamps from df2 -/+ the window.

window = 3

>>> start
array([27, 47])

>>> end
array([33, 53])

Now let's look at the result from applying the first part of the lambda expression. For each timestamp in df1, it provides a boolean array indicating if that time stamp is greater than each start time based on the timestamps in df2.

>>> df1['Timestamp'].apply(lambda ts: (start <= ts))
0    [False, False]  # 27 <= 10, 47 <= 10
2     [True, False]  # 27 <= 28, 47 <= 28
4     [True, False]  # 27 <= 40, 47 <= 40
6      [True, True]  # 27 <= 51, 47 <= 51
Name: Timestamp, dtype: object

We then took at the second part of the lambda expression using the same logic.

>>> df1['Timestamp'].apply(lambda ts: (ts <= end))
0     [True, True]  # 10 <= 33, 10 <= 55
2     [True, True]  # 28 <= 33, 28 <= 55
4    [False, True]  # 40 <= 33, 10 <= 40
6    [False, True]  # 51 <= 33, 10 <= 51
Name: Timestamp, dtype: object

We then combine the results in parallel using the & operator.

>>> df1['Timestamp'].apply(lambda ts: ((start <= ts) & (ts <= end)))
0    [False, False]  # False & True, False & True <=> (27 <= 10) & (10 <= 33), (47 <= 10) & (10 <= 55)
2     [True, False]  # True & True, False & True
4    [False, False]  # True & False, False & True
6     [False, True]  # True & False, True & True
Name: Timestamp, dtype: object

Given that we are looking any event from df1 that falls within any window from df2, we apply .any() to our result from above to create the boolean mask.

>>> df1['Timestamp'].apply(lambda ts: ((start <= ts) & (ts <= end)).any())
0    False
2     True
4    False
6     True
Name: Timestamp, dtype: bool

Which results in the following selected events:

>>> df1[df1['Timestamp'].apply(lambda ts: ((start <= ts) & (ts <= end)).any())]
   Energy  Timestamp  Channel
2     314         28        1
6     625         51        1

The timestamp of 28 from the first event falls within the window from the first event in df2, i.e. 30 -/+ 3.

The timestamp of 51 from the second event falls within the window from the other event in df2, i.e. 50 -/+ 3.

Filter a dataframe column based on values in a second column being within a tolerance value of any rows in a second dataframe

Answers (1)

Related Questions