eliteA92
eliteA92

Reputation: 421

Best way of joining many time series in pandas to a single datetime index?

I am reading many CSV files. Each one contains time series data. For example:

import pandas as pd

csv_a = [['2019-05-25 10:00', 25, 60],
         ['2019-05-25 10:05', 26, 25],
         ['2019-05-25 10:10', 27, 63],
         ['2019-05-25 10:20', 28, 62]]
df_a = pd.DataFrame(csv_a, columns=["Timestamp", "Temperature", "Humidity"])
df_a["Timestamp"] = (pd.to_datetime(df_a["Timestamp"]))

csv_b = [['2019-05-25 10:05', 1020],
         ['2019-05-25 10:10', 1021],
         ['2019-05-25 10:15', 1019],
         ['2019-05-25 10:45', 1035]]
df_b = pd.DataFrame(csv_b, columns=["Timestamp", "Pressure"])
df_b["Timestamp"] = (pd.to_datetime(df_b["Timestamp"]))

After creating these Dataframes, we can see:

print(df_a)
            Timestamp  Temperature  Humidity
0 2019-05-25 10:00:00           25        60
1 2019-05-25 10:05:00           26        25
2 2019-05-25 10:10:00           27        63
3 2019-05-25 10:20:00           28        62

print(df_b)
            Timestamp  Pressure
0 2019-05-25 10:05:00      1020
1 2019-05-25 10:10:00      1021
2 2019-05-25 10:15:00      1019
3 2019-05-25 10:45:00      1035

I want to create a new Dataframe with a regular index, for example:

import datetime as dt

start = dt.datetime(2019,5,25,10,0,0)
end = dt.datetime(2019,5,25,10,20,0)
index = pd.date_range(start, end, freq='5min')

And then, start appending each time series in different columns, filling the missing values with NaN and discarting values out of my index.

Desired output:

                     Temperature  Humidity  Pressure
Timestamp                                           
2019-05-25 10:00:00         25.0      60.0       NaN
2019-05-25 10:05:00         26.0      25.0    1020.0
2019-05-25 10:10:00         27.0      63.0    1021.0
2019-05-25 10:15:00          NaN       NaN    1019.0
2019-05-25 10:20:00         28.0      62.0       NaN

And I also want do this as efficient as possible. Let's say I have hundreds of CSVs and long series.

I am messing with Panda's functions like concat or append, but I am not able to obtain what I want.

Upvotes: 1

Views: 496

Answers (3)

ansev
ansev

Reputation: 30920

Use DataFrame.merge. You could check with Series.diff to discard rows where there is a temporary jump greater than the period. But you could choose another criteria to exclude rows (let me know if you want to choose another criteria)

df2 = (df_a.merge(df_b, on='Timestamp', how='outer')
           .sort_values('Timestamp'))
diff=df2['Timestamp'].diff().abs().bfill()
mask=diff.eq(diff.min())
new_df=(df2.loc[mask]
           .set_index('Timestamp')
       )
print(new_df)


#                     Temperature  Humidity  Pressure
#Timestamp                                           
#2019-05-25 10:00:00         25.0      60.0       NaN
#2019-05-25 10:05:00         26.0      25.0    1020.0
#2019-05-25 10:10:00         27.0      63.0    1021.0
#2019-05-25 10:15:00          NaN       NaN    1019.0
#2019-05-25 10:20:00         28.0      62.0       NaN

You could select the frecuency and rule out those who don't comply by doing

df2 = (df_a.merge(df_b, on='Timestamp', how='outer')
           .set_index('Timestamp')
        )

new_df=(df2.reindex(pd.date_range(df2.index.min(),df2.index.max(),freq='5min'))
           .loc[lambda x: x.isna().all(axis=1).cumsum().eq(0)])

or simply enter the lower and upper limit statically as you say in your question

Upvotes: 2

Andy L.
Andy L.

Reputation: 25239

As I understand you already had a custom datetimeindex index and want to join each time series by this index. Try combine_first and reindex. If you have multiple time series to join, you need to use loop or use python reduce

df_out = df_b.combine_first(df_a).reindex(index)

Out[1063]:
                     Humidity  Pressure  Temperature
2019-05-25 10:00:00      60.0       NaN         25.0
2019-05-25 10:05:00      25.0    1020.0         26.0
2019-05-25 10:10:00      63.0    1021.0         27.0
2019-05-25 10:15:00       NaN    1019.0          NaN
2019-05-25 10:20:00      62.0       NaN         28.0

If your time series don't have the same column names, you may try join. Just list the time series inside the bracket []

df_out = df_a.join([df_b], how='outer').reindex(index)

Out[1068]:
                     Temperature  Humidity  Pressure
2019-05-25 10:00:00         25.0      60.0       NaN
2019-05-25 10:05:00         26.0      25.0    1020.0
2019-05-25 10:10:00         27.0      63.0    1021.0
2019-05-25 10:15:00          NaN       NaN    1019.0
2019-05-25 10:20:00         28.0      62.0       NaN

Upvotes: 3

davidbilla
davidbilla

Reputation: 2222

Did you try pd.merge?

pd.merge(df_a, df_b, how='outer').set_index('Timestamp').sort_index()

output:

                     Temperature  Humidity  Pressure
Timestamp                                           
2019-05-25 10:00:00         25.0      60.0       NaN
2019-05-25 10:05:00         26.0      25.0    1020.0
2019-05-25 10:10:00         27.0      63.0    1021.0
2019-05-25 10:15:00          NaN       NaN    1019.0
2019-05-25 10:20:00         28.0      62.0       NaN
2019-05-25 10:45:00          NaN       NaN    1035.0

Upvotes: 1

Related Questions