Combining multiple timeseries data to one 2d numpy array

Question

I have timeseries data from three different sensors over the period of a year, the sensors produce a data point roughly every 3 minutes, the sensors are not synchronized so they produce a datapoint output at different times relative to each other.

This data is in an sqlite db in one table of approximately half a million records. I intend to display this data using the javascript chart library dygraph, I have already produced timeseries charts for each of these sensors individually by doing an sql query by sensor name and save to csv. I wish to have one chart which displays all the data points, with a line representing each sensor.

I have created a numpy 2d array of type string called 'minutes_array' with the first column as unix timestamps rounded to the nearest minute covering every minute from the start of the sensor timeseries to the end with three empty columns to be filled with data from each of the 3 sensors where available.

minutes_array

[['1316275620' '' '' '']
 ['1316275680' '' '' '']
 ['1316275740' '' '' '']
 ..., 
 ['1343206920' '' '' '']
 ['1343206980' '' '' '']
 ['1343207040' '' '' '']]

The sensor timeseries data is then also rounded to the nearest minute and I use numpy.in1d and take the timestamps from the above 'minutes_array' and the 'sensor_data' array and create a mask for the records relating to that sensor.

sensor_data

[['1316275680' '215.2']
 ['1316275860' '227.0']
 ['1316276280' '212.2']
 ..., 
 ['1343206380' '187.7']
 ['1343206620' '189.4']
 ['1343206980' '192.9']]

 mask = np.in1d(minutes_array[:,0], sensor_data[:,0])

 [False  True False ..., False  True False]

I then wish to modify the records in minutes_array which are true for that mask and place the sensor_data value into the first column following the timestamp in minutes_array. From my attempts it does not seem possible to alter the original 'minutes_array' when a mask is applied to it, is there a way to achieve this outcome in numpy without using for loops and matching timestamps individually?

Solved

Based on the answer below from @eumiro I used a solution from the Pandas Docs and the 'sensor_data' numpy array described above

sensors = ['s1','s2','s3']    
sensor_results = {}
for sensor in sensors:
    sensor_data = get_array(db_cursor, sensor)
    sensor_results[sensor] = pd.Series(sensor_data[:,1], \
                                   index=sensor_data[:,0])
df = pd.DataFrame(buoy_results)
df.to_csv("output.csv")

eumiro · Accepted Answer

Half a million is not a number you could not manage with a python dictionary.

Read data for all sensors from database, fill a dictionary and then build a numpy array, or even better, convert it to pandas.DataFrame:

import pandas as pd

inp1 = [(1316275620,   1), (1316275680,   2)]
inp2 = [(1316275620,  10), (1316275740,  20)]
inp3 = [(1316275680, 100), (1316275740, 200)]

inps = [('s1', inp1), ('s2', inp2), ('s3', inp3)]

data = {}
for name, inp in inps:
    d = data.setdefault(name, {})
    for timestamp, value in inp:
        d[timestamp] = value
df = pd.DataFrame.from_dict(data)

df is now:

            s1  s2   s3
1316275620   1  10  NaN
1316275680   2 NaN  100
1316275740 NaN  20  200

Combining multiple timeseries data to one 2d numpy array

Solved

Answers (1)

Related Questions