Joining time series data from multiple sources, subsetted by the least comprehensive dataset

Question

I'm creating a script in python which uses 'n' CSV's of univariate data. Each CSV has the readings from a sensor and the time (to the nearest minute) that it was taken. I want to merge all these scripts into one CSV that will index all the sensor data by shared timestamps, e.g. below

Ideally all 'n' sensors would be collecting data over the same time period, however, there will be some slight lag between the first and last sensors connected. The way I envision this code working is that a subset of the other CSV's will be derived using the latest common start and finish time.

Currently, I'm looping through the local directory and storing my CSVs as pandas datasets inside a dictionary, indexing them by their 'time' column and using .concat to join them. This works so long as the indexes are identical; however, I'm trying to make the script more robust to minor variations between the datasets.


Sensor 1           
Time  Reading
11:05  90
11:05  92
11:05  93
11:06  90

Sensor 2
Time  Reading
11:05  93
11:05  95 
11:06  93
11:07  92

Joined csv
Time  Sensor1  Sensor 2
11:05  92       93
11:05  93       95
11:06  90       93

# current code concatenates CSVs but can't deal with minor variations in time indexes

frames = {}
sensors = ['time']

for fname in glob.glob(file):
    files = [i for i in glob.glob(file)]

for i in range(len(files)):
    frames["sensor_{0}".format(i+1)] = pd.read_csv(files[i], usecols=[1,2])

frames = { k: v.set_index('time') for k, v in frames.items()}

dataframe = pd.concat(frames, axis=1) 

dataframe.to_csv(path+"\concat.csv", index=True, header=False)

dataframe=pd.read_csv(path+"\concat.csv")

for i in range(len(files)):
    sensors.append('sensor '+ str(1+i))

dataframe.columns = sensors

dataframe['time'] = pd.to_datetime(dataframe['time'])
dataframe = dataframe.set_index('time')

TLDR; is there a way to create a joined CSV of timestamped data, subsetted by common timestamp values across n number of CSV's, in Python?

Itamar Mushkin · Accepted Answer

For example, let's look at two dataframes, similar to what you've posted, but with one datapoint shifted by a few seconds (I'll take that as an example of a "minor variation").

import pandas as pd

df1 = pd.DataFrame(
    data= [('11:05',90),
           ('11:05',  92),
           ('11:05',  93),
           ('11:06', 90)],
    columns = ('Time', 'Sensor1'))

df2 = pd.DataFrame(
    data= [('11:05',93),
           ('11:05:03',  95),
           ('11:06',  93),
           ('11:07', 92)],
    columns = ('Time', 'Sensor2'))

First of all, we'll change the "Time" column from a text format to a pd.Timestamp format:

df1['Time'] = df1['Time'].apply(pd.Timestamp)
df2['Time'] = df2['Time'].apply(pd.Timestamp)

This will add the current date of today as the date, but that's not important for this example.

To deal with "minor variations", we can .resample the time series to align it to a time-resolution of one minute (if we want another resolution, we look at the docs of .resample)

df1 = df1.set_index('Time').resample('1T').mean() # mean - just for example; can also be median or some other function
df2 = df2.set_index('Time').resample('1T').mean()

Finally, we .join the two dataframes based on their index (that we set to be time), keeping only the intersection of in

df1.join(df2, how='inner')

And that's it!

Joining time series data from multiple sources, subsetted by the least comprehensive dataset

Answers (1)

Related Questions