Han Zhengzu
Han Zhengzu

Reputation: 3842

Fill NaN value to continuous time series data where some timeframe were missing

An illustration of my question

Dealing with hourly varying time series data which contain datetime from 2016-01-01 00:00 to 2016-01-07 23:00 and some feature corresponding to each timeframe.

Ideally, there would be 7 x 24 rows of data covering all the time period. But sometimes, the hourly data may interrupt by some reason (e.g capturing hourly weather information, but the website were broke in specific hours.)

My dataframe now contain all identical data without missing hour. The length of my dataframe were 7 *24 - 5, which mean there were 5 missing hourly data.

The datetime are saved in the format of 201601010100 representing 2016-01-01 01:00:00

My attempt.

I have tried to generated new dataframe which is length of 7*24 containing continuous hourly datetime

data = {"datetime":[],"feature1":[],"feature2":[]}
ff = pd.DataFrame(data)
rng = pd.date_range('01/01/2016', periods=600, freq='H')
new_date = list(rng.strftime('%Y%m%d%H'))
ff['datetime'] = new_date

The original dataframe was df which containing datetime, feature1,feature2.

I tried to f3 = ff.merge(df, on ='dataframe'').

But f3 was still in the length of 7*24.

My target

Due to large temporal coverage and feature dimension in pratical, I don't want to loop the two dataframe and compare their datetime.values one by one.

I want to replace the value of feature in ff based on the identical datetime in df and keep the NaN information still.

Upvotes: 1

Views: 3070

Answers (2)

Jonathan Eunice
Jonathan Eunice

Reputation: 22443

Missing data is pretty common in dealing with date/time-indexed data. There are no perfect solutions--the perfect solution would been to have had all the data in the first place, on exactly the cadence you wanted--but there are ways of managing with missing data.

Let's start with some sample data:

import pandas as pd
import numpy as np

data = {"datetime": [201601010100,201601010200,201601010400,201601010500],
        "feature1": np.random.randint(1,100,4),
        "feature2": np.random.rand(4) }
ff = pd.DataFrame(data)

enter image description here

Let's give it a real DatetimeIndex:

ff.index = pd.to_datetime(ff.datetime.astype(str))
ff.index.name = None

enter image description here

And finally reindex with a similar index that has all the hours

di = pd.DatetimeIndex(start=ff.index[0], 
                      end=ff.index[-1], 
                      freq='H')
ff = ff.reindex(di, fill_value='--missing--')

enter image description here

Voila! Your data is presented with the missing data points clearly marked. As a final grace note, you might decide to remove the original low-level timestamps:

del ff['datetime']

Yielding:

enter image description here

There are still some complexities to worry about. Pandas isn't perfect, and it struggles to deal with columns that can contain multiple types. It's fundamentally tied to Numpy arrays as its underlying storage engine, which demands homogeneous arrays. So when you reindex like this, you introduce NaN or another designated missing values into your other columns. I chose a string value for the missing column, which looks nice, but recasts all your columns to object type, which has performance implications if your data frame is large. The default alternative, inserting NaN in missing values, will recast all your numerical columns to floating point. One or the other may be more palatable to you. The floating point one works well, e.g., if all your feature columns are floating point values already. But neither is perfect.

Upvotes: 2

DYZ
DYZ

Reputation: 57033

Create an empty dataframe with rng as index:

skeleton = pd.DataFrame(index=rng)

Convert the original dates to numpy.datetime64 to make them compatible with timerange:

df['datetime_ns'] = df['datetime'].astype(numpy.datetime64)

Perform an outer join of the frames on index and datetime_ns:

new_df = df.merge(skeleton, left_on='datetime_ns',right_index=True,how='outer')

Sort the new dataframe, if necessary:

new_df.sort_values('datetime_ns', inplace=True)

Upvotes: 1

Related Questions