Reputation: 3842
Dealing with hourly varying time series data which contain datetime from 2016-01-01 00:00 to 2016-01-07 23:00 and some feature corresponding to each timeframe.
Ideally, there would be 7 x 24 rows of data covering all the time period. But sometimes, the hourly data may interrupt by some reason (e.g capturing hourly weather information, but the website were broke in specific hours.)
My dataframe now contain all identical data without missing hour. The length of my dataframe were 7 *24 - 5, which mean there were 5 missing hourly data.
The datetime are saved in the format of 201601010100
representing 2016-01-01 01:00:00
I have tried to generated new dataframe which is length of 7*24 containing continuous hourly datetime
data = {"datetime":[],"feature1":[],"feature2":[]}
ff = pd.DataFrame(data)
rng = pd.date_range('01/01/2016', periods=600, freq='H')
new_date = list(rng.strftime('%Y%m%d%H'))
ff['datetime'] = new_date
The original dataframe was df
which containing datetime
, feature1
,feature2
.
I tried to f3 = ff.merge(df, on ='dataframe'').
But f3 was still in the length of 7*24.
Due to large temporal coverage and feature dimension in pratical, I don't want to loop the two dataframe and compare their datetime.values one by one.
I want to replace the value of feature in ff based on the identical datetime in df and keep the NaN information still.
Upvotes: 1
Views: 3070
Reputation: 22443
Missing data is pretty common in dealing with date/time-indexed data. There are no perfect solutions--the perfect solution would been to have had all the data in the first place, on exactly the cadence you wanted--but there are ways of managing with missing data.
Let's start with some sample data:
import pandas as pd
import numpy as np
data = {"datetime": [201601010100,201601010200,201601010400,201601010500],
"feature1": np.random.randint(1,100,4),
"feature2": np.random.rand(4) }
ff = pd.DataFrame(data)
Let's give it a real DatetimeIndex
:
ff.index = pd.to_datetime(ff.datetime.astype(str))
ff.index.name = None
And finally reindex with a similar index that has all the hours
di = pd.DatetimeIndex(start=ff.index[0],
end=ff.index[-1],
freq='H')
ff = ff.reindex(di, fill_value='--missing--')
Voila! Your data is presented with the missing data points clearly marked. As a final grace note, you might decide to remove the original low-level timestamps:
del ff['datetime']
Yielding:
There are still some complexities to worry about. Pandas isn't perfect, and it struggles to deal with columns that can contain multiple types. It's fundamentally tied to Numpy arrays as its underlying storage engine, which demands homogeneous arrays. So when you reindex like this, you introduce NaN
or another designated missing values into your other columns. I chose a string value for the missing column, which looks nice, but recasts all your columns to object
type, which has performance implications if your data frame is large. The default alternative, inserting NaN
in missing values, will recast all your numerical columns to floating point. One or the other may be more palatable to you. The floating point one works well, e.g., if all your feature columns are floating point values already. But neither is perfect.
Upvotes: 2
Reputation: 57033
Create an empty dataframe with rng as index:
skeleton = pd.DataFrame(index=rng)
Convert the original dates to numpy.datetime64 to make them compatible with timerange:
df['datetime_ns'] = df['datetime'].astype(numpy.datetime64)
Perform an outer join of the frames on index and datetime_ns:
new_df = df.merge(skeleton, left_on='datetime_ns',right_index=True,how='outer')
Sort the new dataframe, if necessary:
new_df.sort_values('datetime_ns', inplace=True)
Upvotes: 1