user5503831
user5503831

Reputation: 145

Timeseries as 2 numpy arrays ('Date' and 'Data') and then extracting 'Data' from a specified 'Date' range?

I would like to ask what the best 'Date' datatype is to use for the following problem:

I am reading timeseries data from an ASCII file and creating two numpy arrays; 1) date, 2) data. Once created, I would like to extract data from a date range specified from a different ASCII file (start_time, and duration in seconds). These 'Data' would be extracted into another numpy array and written to an output file.

So far I have:

import numpy as np

    date = ['2016-07-05  12:00:00.000000',
            '2016-07-05  12:00:00.010000',
            '2016-07-05  12:00:00.020000',
            '2016-07-05  12:00:00.030000',
            '2016-07-05  12:00:00.040000',
            '2016-07-05  12:00:00.050000',
            '2016-07-05  12:00:00.060000',
            '2016-07-05  12:00:00.070000',
            '2016-07-05  12:00:00.080000',
            '2016-07-05  12:00:00.090000',
            '2016-07-05  12:00:00.100000',
            '2016-07-05  12:00:00.110000',
            '2016-07-05  12:00:00.120000',
            '2016-07-05  12:00:00.130000',
            '2016-07-05  12:00:00.140000']        
data = [1,2,3,4,5,6,7,8,9,10,11,12,13,14]

date = np.asarray(date, dtype=np.string_)
data = np.asarray(data, dtype=np.float32)

Then in order to extract from the desired date range I have:

start_time = '2016-07-05  12:00:00.030000'
duration = 0.10 
a=[]
a=data[(date >= (start_time) & (date <= (start_time + duration))]

And then turn 'a' into an array for writing into another file:

a=np.asarray(a, dtype='float32')

However - just about every kind of dtype or datetime class I have tried for date gives me some sort of error (e.g. object has no attribute _getitem_ etc.,) or is not compatible with a list. What format would you use in this scenario? POSIX?

Thanks!

Upvotes: 1

Views: 5619

Answers (1)

jeremycg
jeremycg

Reputation: 24945

I would use pandas for this. there is really good support for time series stuff, see the docs. You would probably want to use a time series index for more detailed work, here I am using it as a normal column.

Note, your data and date have a different number of rows, I've added a 0 to the data:

import pandas as pd

date = ['2016-07-05  12:00:00.000000',
            '2016-07-05  12:00:00.010000',
            '2016-07-05  12:00:00.020000',
            '2016-07-05  12:00:00.030000',
            '2016-07-05  12:00:00.040000',
            '2016-07-05  12:00:00.050000',
            '2016-07-05  12:00:00.060000',
            '2016-07-05  12:00:00.070000',
            '2016-07-05  12:00:00.080000',
            '2016-07-05  12:00:00.090000',
            '2016-07-05  12:00:00.100000',
            '2016-07-05  12:00:00.110000',
            '2016-07-05  12:00:00.120000',
            '2016-07-05  12:00:00.130000',
            '2016-07-05  12:00:00.140000']
data = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14]

x = pd.DataFrame({'date':date, 'data':data})
x.date = pd.to_datetime(x.date)

We now have a pandas dataframe with two columns, your date in datetime format, and your data.

Using the to_datetime function, we now have a datetime column, and can easily subset:

start_time = pd.to_datetime('2016-07-05  12:00:00.030000')
end_time = start_time + pd.DateOffset(seconds = 0.10) #or minutes etc etc

x[(x.date < end_time) & (x.date > start_time)]

giving:

    data    date
4   4   2016-07-05 12:00:00.040
5   5   2016-07-05 12:00:00.050
6   6   2016-07-05 12:00:00.060
7   7   2016-07-05 12:00:00.070
8   8   2016-07-05 12:00:00.080
9   9   2016-07-05 12:00:00.090
10  10  2016-07-05 12:00:00.100
11  11  2016-07-05 12:00:00.110
12  12  2016-07-05 12:00:00.120

Upvotes: 1

Related Questions