bischoffingston
bischoffingston

Reputation: 657

Trying to check data frequency with Pandas Series of datetime64 objects

I have some time series data that can be 1Hz, 10Hz, or 100Hz. the file I load in happens to be 1Hz:

In [6]: data = pd.read_csv("ftp.csv")

In [7]: data.Time
Out[7]: 
0             NaN
1     11:30:08 AM
2     11:30:09 AM
3     11:30:10 AM
4     11:30:11 AM
5     11:30:12 AM
6     11:30:13 AM

I convert it to datetime with:

In [8]: time = pd.to_datetime(data.Time)

In [9]: time
Out[9]: 
0                    NaT
1    2015-03-03 11:30:08
2    2015-03-03 11:30:09
3    2015-03-03 11:30:10
4    2015-03-03 11:30:11
5    2015-03-03 11:30:12

From here how can I verify what the sampling frequency is? Do I have to do this manually or can I use a built in pandas method?

Upvotes: 3

Views: 4541

Answers (2)

BE-Bob
BE-Bob

Reputation: 81

I deal in sampled acceleration data on a regular basis. Typically, the data has some sampling rate jitter (the samples are not always at equal delta-t).

Recently, I had to develop a method to determine the "average sampling rate" to verify that the data were obtained at the correct frequency. The typical Pandas methods were not particularly helpful for me and I could not find something directly on-point. This post was the closest I could find.

I did refine the existing answers and added some capability, hopefully you'll find it useful.

My time data are in datetime64 format (converted from a column of strings in a dataframe via the pd.to_datetime() function) and include the complete date and time.

Data = pd.DataFrame(
["2025-01-17 01:07:19.500976776",
"2025-01-17 01:07:19.501953038",
"2025-01-17 01:07:19.502929501",
"2025-01-17 01:07:19.503906163",
"2025-01-17 01:07:19.504882926",
"2025-01-17 01:07:19.505859488",
"2025-01-17 01:07:19.506835851"],
columns = ['Time'],
dtype='datetime64[ns]')

I first convert the Time column to timedelta64 by subtracting out the first value from the remainder, which produces a Pandas Series (not a DataFrame):

rel_time = Data.Time - Data.Time.iloc[0]

rel_time
Out[42]: 
0             0 days 00:00:00
1   0 days 00:00:00.000976262
2   0 days 00:00:00.001952725
3   0 days 00:00:00.002929387
4   0 days 00:00:00.003906150
5   0 days 00:00:00.004882712
6   0 days 00:00:00.005859075
Name: Time, dtype: timedelta64[ns]

Then, I use the dt accessor of the Series to get the total seconds for each time. Note that the output type is float64 and the units are implicitly seconds:

rel_time = rel_time.dt.total_seconds()

rel_time
Out[45]: 
0    0.000000
1    0.000976
2    0.001953
3    0.002929
4    0.003906
5    0.004883
6    0.005859
Name: Time, dtype: float64

Finally, I'm ready to start investigating frequency in Hz. I use the Series.diff() and Series.describe() to gather the information including a measure of the jitter:

dTime = rel_time.diff().describe()

dTime
Out[49]: 
count    6.000000e+00
mean     9.765125e-04
std      1.871371e-07
min      9.762620e-04
25%      9.763880e-04
50%      9.765125e-04
75%      9.766370e-04
max      9.767630e-04
Name: Time, dtype: float64

The average sample rate in Hz is:

1./dTime['mean']
Out[50]: 1024.052431484492

and the standard deviation of the sampling time (in seconds) is: dTime['std']

dTime['std']
Out[51]: 1.8713711550629458e-07

Although this is off-topic, I usually have specifications for the relative SD:

dTime['std']/dTime['mean']
Out[62]: 0.0001916382181552152

The benefits of looking at sampling rate this way are

  1. The whole data set is evaluated, but in one go taking into account that jitter exists
  2. The amount (and technically the distribution) of jitter can be handled via statistics
  3. Using the datetime methods allows the data to span boundaries that include days and weeks without having to deal with discontinuities
  4. The total_seconds is an indispensable tool because most data wranglers who use Pandas don't seem to be interested in the length of time in seconds, just the extracted value of seconds at a given point in time (i.e. the to_seconds method that returns an int)

Note, finally, that I injected some extra jitter into the data to make things more interesting.

Upvotes: 0

EdChum
EdChum

Reputation: 394091

One method after converting to datetime64, if frequency sampling rate is the same then we could call diff() to calculate the difference between all rows which should be the same and compare this with a np.timedelta64 type, so for your sample data this would be:

In [277]:

all(df.datetime.diff()[1:] == np.timedelta64(1, 's')) == True
Out[277]:
True

In [278]:

df.datetime.diff()
Out[278]:
0
1        NaT
2   00:00:01
3   00:00:01
4   00:00:01
5   00:00:01
6   00:00:01
Name: datetime, dtype: timedelta64[ns]
In [279]:

df.datetime.diff()[1:] == np.timedelta64(1, 's')
Out[279]:
0
2    True
3    True
4    True
5    True
6    True
Name: datetime, dtype: bool

to check if the freq was 10hz or 100hz just change the units to np.timedelta64 so for 10hz: np.timedelta64(100, 'ms') and for 100hz: np.timedelta64(10, 'ms')

The np.timedelta64 units can be found here: http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-and-timedelta-arithmetic

Upvotes: 4

Related Questions