Reputation: 45
I am dealing with experimental measurements of time-correlated gamma-ray emissions with a pair of detectors. I have a long-form dataframe which lists every gamma-ray detected and displays its energy, the timestamp of the event, and the detector channel. Here is the sample structure of this dataframe:
df
Energy Timestamp Channel
0 639 753437128196030 1
1 798 753437128196010 2
2 314 753437131148580 1
3 593 753437131148510 2
4 2341 753437133607800 1
I must filter these data and according to the following conditions:
Return the energies of all events detected in Channel 1 which occur within one user-selectable timing_window
of events detected in Channel 2.
Furthermore, only the events in Channel 2 that are within the energy range [E_lo, E_hi]
should be considered when evaluating the timing window conditions.
So far, I have tried the following:
Separate the energy data of each detector into individual dataframes:
d1_all = df[(df["Channel"] == 1)]
d2_all = df[(df["Channel"] == 2)]
Reset the indices of d1_all
:
d1_all = d1_all.reset_index()
d1_all = d1_all.drop(['index'], axis=1)
d1_all.head()
Retain only the events in d2_all
which occur in the range [E_lo=300, E_hi=600]
d2_gate = d2_all[(d2_all["Energy"] >= 300) & (d2_all["Energy"] <=600)]
Reset the indices of d2_all
:
d2_gate = d2_gate.reset_index()
d2_gate = d2_gate.drop(['index'], axis=1)
d2_gate.head()
Everything up to this point works fine. Here is the biggest problem. The following code evaluates each event in detector 1 to determine if its timestamp is within one timing_window
of the timestamp corresponding to ANY event within the energy range E_lo to E_hi in detector 2. The problem is that this dataframe can have on the order of 10's to 100's of thousands of entries for each detector, and the current code takes essentially forever to run. This code uses nested for
loops.
for i in range(0, d1_all.shape[0]):
coincidence = False
for j in range(0, d2_gate.shape[0]):
if ((d1_all.iloc[i]["Timestamp"]) >=
(d2_gate.iloc[j]["Timestamp"] - coin_window)) and ((d1_all.iloc[i]
["Timestamp"]) <= (d2_gate.iloc[j]["Timestamp"] + coin_window)):
coincidence = True
break
else:
pass
if coincidence == True:
pass
elif coincidence == False:
d1_all = d1_all.drop([i])
Any help identifying a faster implementation of evaluating for coincidences would be greatly appreciated! Thank you!
Upvotes: 2
Views: 280
Reputation: 109546
Perhaps this will work? As you have done, it first splits the data into two dataframes corresponding to the channel. The second dataframe also filters for the energy between the max and min levels.
I then create a numpy array of start and end times, corresponding to the timestamps in df2
-/+ the window
.
window = 10000 # For example.
min_energy = 300
max_energy = 600
df1 = df[df['Channel'].eq(1)]
df2 = df.loc[df['Channel'].eq(2)
& df['Energy'].ge(min_energy)
& df['Energy'].le(max_energy)]
start = np.array(df2['Timestamp'] - window)
end = np.array(df2['Timestamp'] + window)
df1[df1['Timestamp'].apply(lambda ts: ((start <= ts) & (ts <= end)).any())]
To explain the lambda function, I'll provide the following sample data (timestamps units are to make them more readable):
df = pd.DataFrame({
'Energy': [639, 798, 314, 593, 2341, 550, 625],
'Timestamp': [10, 20, 28, 30, 40, 50, 51],
'Channel': [1, 2, 1, 2, 1, 2, 1]
})
After applying the code above:
>>> df1
Energy Timestamp Channel
0 639 10 1
2 314 28 1
4 2341 40 1
6 625 51 1
>>> df2
Energy Timestamp Channel
3 593 30 2
5 550 50 2
I use a window
of 3 for this example, which gives the following start and end times based on the timestamps from df2
-/+ the window.
window = 3
>>> start
array([27, 47])
>>> end
array([33, 53])
Now let's look at the result from applying the first part of the lambda expression. For each timestamp in df1
, it provides a boolean array indicating if that time stamp is greater than each start time based on the timestamps in df2
.
>>> df1['Timestamp'].apply(lambda ts: (start <= ts))
0 [False, False] # 27 <= 10, 47 <= 10
2 [True, False] # 27 <= 28, 47 <= 28
4 [True, False] # 27 <= 40, 47 <= 40
6 [True, True] # 27 <= 51, 47 <= 51
Name: Timestamp, dtype: object
We then took at the second part of the lambda expression using the same logic.
>>> df1['Timestamp'].apply(lambda ts: (ts <= end))
0 [True, True] # 10 <= 33, 10 <= 55
2 [True, True] # 28 <= 33, 28 <= 55
4 [False, True] # 40 <= 33, 10 <= 40
6 [False, True] # 51 <= 33, 10 <= 51
Name: Timestamp, dtype: object
We then combine the results in parallel using the &
operator.
>>> df1['Timestamp'].apply(lambda ts: ((start <= ts) & (ts <= end)))
0 [False, False] # False & True, False & True <=> (27 <= 10) & (10 <= 33), (47 <= 10) & (10 <= 55)
2 [True, False] # True & True, False & True
4 [False, False] # True & False, False & True
6 [False, True] # True & False, True & True
Name: Timestamp, dtype: object
Given that we are looking any event from df1
that falls within any window from df2
, we apply .any()
to our result from above to create the boolean mask.
>>> df1['Timestamp'].apply(lambda ts: ((start <= ts) & (ts <= end)).any())
0 False
2 True
4 False
6 True
Name: Timestamp, dtype: bool
Which results in the following selected events:
>>> df1[df1['Timestamp'].apply(lambda ts: ((start <= ts) & (ts <= end)).any())]
Energy Timestamp Channel
2 314 28 1
6 625 51 1
The timestamp of 28 from the first event falls within the window from the first event in df2
, i.e. 30 -/+ 3.
The timestamp of 51 from the second event falls within the window from the other event in df2
, i.e. 50 -/+ 3.
Upvotes: 1