Reputation: 421
I am reading many CSV files. Each one contains time series data. For example:
import pandas as pd
csv_a = [['2019-05-25 10:00', 25, 60],
['2019-05-25 10:05', 26, 25],
['2019-05-25 10:10', 27, 63],
['2019-05-25 10:20', 28, 62]]
df_a = pd.DataFrame(csv_a, columns=["Timestamp", "Temperature", "Humidity"])
df_a["Timestamp"] = (pd.to_datetime(df_a["Timestamp"]))
csv_b = [['2019-05-25 10:05', 1020],
['2019-05-25 10:10', 1021],
['2019-05-25 10:15', 1019],
['2019-05-25 10:45', 1035]]
df_b = pd.DataFrame(csv_b, columns=["Timestamp", "Pressure"])
df_b["Timestamp"] = (pd.to_datetime(df_b["Timestamp"]))
After creating these Dataframes, we can see:
print(df_a)
Timestamp Temperature Humidity
0 2019-05-25 10:00:00 25 60
1 2019-05-25 10:05:00 26 25
2 2019-05-25 10:10:00 27 63
3 2019-05-25 10:20:00 28 62
print(df_b)
Timestamp Pressure
0 2019-05-25 10:05:00 1020
1 2019-05-25 10:10:00 1021
2 2019-05-25 10:15:00 1019
3 2019-05-25 10:45:00 1035
I want to create a new Dataframe with a regular index, for example:
import datetime as dt
start = dt.datetime(2019,5,25,10,0,0)
end = dt.datetime(2019,5,25,10,20,0)
index = pd.date_range(start, end, freq='5min')
And then, start appending each time series in different columns, filling the missing values with NaN and discarting values out of my index.
Desired output:
Temperature Humidity Pressure
Timestamp
2019-05-25 10:00:00 25.0 60.0 NaN
2019-05-25 10:05:00 26.0 25.0 1020.0
2019-05-25 10:10:00 27.0 63.0 1021.0
2019-05-25 10:15:00 NaN NaN 1019.0
2019-05-25 10:20:00 28.0 62.0 NaN
And I also want do this as efficient as possible. Let's say I have hundreds of CSVs and long series.
I am messing with Panda's functions like concat or append, but I am not able to obtain what I want.
Upvotes: 1
Views: 496
Reputation: 30920
Use DataFrame.merge
. You could check with Series.diff
to discard rows where there is a temporary jump greater than the period.
But you could choose another criteria to exclude rows (let me know if you want to choose another criteria)
df2 = (df_a.merge(df_b, on='Timestamp', how='outer')
.sort_values('Timestamp'))
diff=df2['Timestamp'].diff().abs().bfill()
mask=diff.eq(diff.min())
new_df=(df2.loc[mask]
.set_index('Timestamp')
)
print(new_df)
# Temperature Humidity Pressure
#Timestamp
#2019-05-25 10:00:00 25.0 60.0 NaN
#2019-05-25 10:05:00 26.0 25.0 1020.0
#2019-05-25 10:10:00 27.0 63.0 1021.0
#2019-05-25 10:15:00 NaN NaN 1019.0
#2019-05-25 10:20:00 28.0 62.0 NaN
You could select the frecuency and rule out those who don't comply by doing
df2 = (df_a.merge(df_b, on='Timestamp', how='outer')
.set_index('Timestamp')
)
new_df=(df2.reindex(pd.date_range(df2.index.min(),df2.index.max(),freq='5min'))
.loc[lambda x: x.isna().all(axis=1).cumsum().eq(0)])
or simply enter the lower and upper limit statically as you say in your question
Upvotes: 2
Reputation: 25239
As I understand you already had a custom datetimeindex index
and want to join each time series by this index
. Try combine_first
and reindex
. If you have multiple time series to join, you need to use loop or use python reduce
df_out = df_b.combine_first(df_a).reindex(index)
Out[1063]:
Humidity Pressure Temperature
2019-05-25 10:00:00 60.0 NaN 25.0
2019-05-25 10:05:00 25.0 1020.0 26.0
2019-05-25 10:10:00 63.0 1021.0 27.0
2019-05-25 10:15:00 NaN 1019.0 NaN
2019-05-25 10:20:00 62.0 NaN 28.0
If your time series don't have the same column names, you may try join
. Just list the time series inside the bracket []
df_out = df_a.join([df_b], how='outer').reindex(index)
Out[1068]:
Temperature Humidity Pressure
2019-05-25 10:00:00 25.0 60.0 NaN
2019-05-25 10:05:00 26.0 25.0 1020.0
2019-05-25 10:10:00 27.0 63.0 1021.0
2019-05-25 10:15:00 NaN NaN 1019.0
2019-05-25 10:20:00 28.0 62.0 NaN
Upvotes: 3
Reputation: 2222
Did you try pd.merge
?
pd.merge(df_a, df_b, how='outer').set_index('Timestamp').sort_index()
output:
Temperature Humidity Pressure
Timestamp
2019-05-25 10:00:00 25.0 60.0 NaN
2019-05-25 10:05:00 26.0 25.0 1020.0
2019-05-25 10:10:00 27.0 63.0 1021.0
2019-05-25 10:15:00 NaN NaN 1019.0
2019-05-25 10:20:00 28.0 62.0 NaN
2019-05-25 10:45:00 NaN NaN 1035.0
Upvotes: 1