Reputation: 29
I'm creating a script in python which uses 'n' CSV's of univariate data. Each CSV has the readings from a sensor and the time (to the nearest minute) that it was taken. I want to merge all these scripts into one CSV that will index all the sensor data by shared timestamps, e.g. below
Ideally all 'n' sensors would be collecting data over the same time period, however, there will be some slight lag between the first and last sensors connected. The way I envision this code working is that a subset of the other CSV's will be derived using the latest common start and finish time.
Currently, I'm looping through the local directory and storing my CSVs as pandas datasets inside a dictionary, indexing them by their 'time' column and using .concat to join them. This works so long as the indexes are identical; however, I'm trying to make the script more robust to minor variations between the datasets.
Sensor 1
Time Reading
11:05 90
11:05 92
11:05 93
11:06 90
Sensor 2
Time Reading
11:05 93
11:05 95
11:06 93
11:07 92
Joined csv
Time Sensor1 Sensor 2
11:05 92 93
11:05 93 95
11:06 90 93
# current code concatenates CSVs but can't deal with minor variations in time indexes
frames = {}
sensors = ['time']
for fname in glob.glob(file):
files = [i for i in glob.glob(file)]
for i in range(len(files)):
frames["sensor_{0}".format(i+1)] = pd.read_csv(files[i], usecols=[1,2])
frames = { k: v.set_index('time') for k, v in frames.items()}
dataframe = pd.concat(frames, axis=1)
dataframe.to_csv(path+"\\concat.csv", index=True, header=False)
dataframe=pd.read_csv(path+"\\concat.csv")
for i in range(len(files)):
sensors.append('sensor '+ str(1+i))
dataframe.columns = sensors
dataframe['time'] = pd.to_datetime(dataframe['time'])
dataframe = dataframe.set_index('time')
TLDR; is there a way to create a joined CSV of timestamped data, subsetted by common timestamp values across n number of CSV's, in Python?
Upvotes: 1
Views: 588
Reputation: 2905
For example, let's look at two dataframes, similar to what you've posted, but with one datapoint shifted by a few seconds (I'll take that as an example of a "minor variation").
import pandas as pd
df1 = pd.DataFrame(
data= [('11:05',90),
('11:05', 92),
('11:05', 93),
('11:06', 90)],
columns = ('Time', 'Sensor1'))
df2 = pd.DataFrame(
data= [('11:05',93),
('11:05:03', 95),
('11:06', 93),
('11:07', 92)],
columns = ('Time', 'Sensor2'))
First of all, we'll change the "Time" column from a text format to a pd.Timestamp
format:
df1['Time'] = df1['Time'].apply(pd.Timestamp)
df2['Time'] = df2['Time'].apply(pd.Timestamp)
This will add the current date of today as the date, but that's not important for this example.
To deal with "minor variations", we can .resample
the time series to align it to a time-resolution of one minute (if we want another resolution, we look at the docs of .resample)
df1 = df1.set_index('Time').resample('1T').mean() # mean - just for example; can also be median or some other function
df2 = df2.set_index('Time').resample('1T').mean()
Finally, we .join
the two dataframes based on their index (that we set to be time), keeping only the intersection of in
df1.join(df2, how='inner')
And that's it!
Upvotes: 1