Matteo Brini
Matteo Brini

Reputation: 183

Optimize file reading with numpy

I have a .dat file made by an FPGA. The file contains 3 columns: the first is the input channel (it can be 1 or 2), the second column is the timestamp at which an event occurred, the third is the local time at which the same event occurred. The third column is necessary because sometimes the FPGA has to reset the clock counter in such a way that it doesn't count in a continuous way. An example of what I am saying is represented in the next figure. Sawtooth trend.

An example of some lines from the .datfile is the following:

1   80.80051152 2022-02-24T18:28:49.602000
2   80.91821978 2022-02-24T18:28:49.716000
1   80.94284154 2022-02-24T18:28:49.732000
2   0.01856876  2022-02-24T18:29:15.068000
2   0.04225772  2022-02-24T18:29:15.100000
2   0.11766780  2022-02-24T18:29:15.178000

The time column is given by the FPGA (in tens of nanosecond), the date column is written by the python script that listen the data from the FPGA, when it has to write a timestamp it saves also the local time as a date.

I am interested in getting two arrays (one for each channel) where I have for each event the time at which that event occurs relatively to the starting time of the acquisition. An example of how the data given before should look at the end is the following:

8.091821978000000115e+01
1.062702197800000050e+02
1.062939087400000062e+02
1.063693188200000179e+02

These data refere to the second channel only. Double check can be made by observing third column in the previous data.

I tried to achieve this whit a function (too messy to me) where I check every time if the difference between two consecutive events in time is greater than 1 second respect to the difference in local time, if that's the case I evaluate the time interval through the local time column. So I correct the timestamp by the right amount of time:

ch, time, date = np.genfromtxt("events220302_1d.dat", unpack=True, 
                               dtype=(int, float, 'datetime64[ms]'))

mask1 = ch==1
mask2 = ch==2

time1 = time[mask1]
time2 = time[mask2]
date1 = date[mask1]
date2 = date[mask2]

corr1 = np.zeros(len(time1))

for idx, val in enumerate(time1):
    if idx < len(time1) - 1:
        if check_dif(time1[idx], time1[idx+1], date1[idx], date1[idx+1]) == 0:
            corr1[idx+1] = val + (date1[idx+1]-date1[idx])/np.timedelta64(1,'s') - time1[idx+1]
time1 = time1 + corr1.cumsum()

Where check_dif is a function that returns 0 if the difference in time between consecutive events is inconsistent with the difference in date between the two same events as I said before.

Is there any more elegant or even faster way to get what I want with maybe some fancy NumPy coding?

Upvotes: 1

Views: 95

Answers (1)

Alexiei
Alexiei

Reputation: 31

A simple initial way to optimize your code is to make the code if-less, thus getting rid of both the if statements. To do so, instead of returning 0 in check_dif, you can return 1 when "the difference in time between consecutive events is inconsistent with the difference in date between the two same events as I said before", otherwise 0.

Your for loop will be something like that:

for idx in range(len(time1) - 1):
    is_dif = check_dif(time1[idx], time1[idx+1], date1[idx], date1[idx+1])
    # Correction value: if is_dif == 0, no correction; otherwise a correction takes place
    correction = is_dif * (date1[idx+1]-date1[idx])/np.timedelta64(1,'s') - time1[idx+1]
    corr1[idx+1] = time1[idx] + correction

A more numpy way to do things could be through vectorization. I don't know if you have some benchmark on the speed or how big the file is, but I think in your case the previous change should be good enough

Upvotes: 3

Related Questions