adamcatto
adamcatto

Reputation: 143

Imputing missing timestamp indices in Pandas?

Here is a snapshot of a dataframe df I'm working with

2014-02-01 09:58:03  1.576119   0.0   8.355      0.0    0.0        1.0    0.0
2014-02-01 09:58:33  1.576119   0.0  13.371      0.0    0.0        1.0    0.0
2014-02-01 09:59:03  1.576119   0.0  13.833      0.0    0.0        1.0    0.0

With Timestamp indices spaced by 30 seconds. I'm trying to concatenate a number of rows populated by np.nan values, while keeping with the pattern of 30 second separated Timestamp indices, i.e. something that would look like

2014-02-01 09:58:03  1.576119   0.0   8.355      0.0    0.0        1.0    0.0
2014-02-01 09:58:33  1.576119   0.0  13.371      0.0    0.0        1.0    0.0
2014-02-01 09:59:03  1.576119   0.0  13.833      0.0    0.0        1.0    0.0
2014-02-01 09:59:33       NaN   NaN     NaN      NaN    NaN        NaN    NaN
2014-02-01 10:00:03       NaN   NaN     NaN      NaN    NaN        NaN    NaN

However, when I apply

df = pd.concat(df, pd.DataFrame(np.array([np.nan, np.nan])))

I'm instead left with

2014-02-01 09:58:03  1.576119   0.0   8.355      0.0    0.0        1.0    0.0
2014-02-01 09:58:33  1.576119   0.0  13.371      0.0    0.0        1.0    0.0
2014-02-01 09:59:03  1.576119   0.0  13.833      0.0    0.0        1.0    0.0
0                         NaN   NaN     NaN      NaN    NaN        NaN    NaN
1                         NaN   NaN     NaN      NaN    NaN        NaN    NaN

My question is: how I can get the timestamp index pattern to continue? Is there something I should specify in the creation of the dataframe to be concatenated, or can I re-index the dataframe shown above?


For a more complete problem statement, I'm working with several time-series dataframes, each of which is hundreds of thousands of rows with the same initial time, varying ending times, and some missing values for each dataframe; I'm trying to get them to match lengths so I can interpolate NaN values by an np.nanmean() at that element's index over all dataframes, which I'm doing by stacking the associated numpy arrays for each dataframe -- applying this averaging procedure across the arrays requires them to have the same dimensions, hence I am filling them with NaNs and interpolating.

Upvotes: 0

Views: 428

Answers (3)

sitting_duck
sitting_duck

Reputation: 3730

Another idea. (Doesn't directly answer your question but might help). This would be useful in situations where not all of your missing data is at the end of the frame.

Create a DF with date range in index:

df_nan = pd.DataFrame(
    index=pd.date_range('2014-02-01 09:58:03',periods=5,freq='30S')
)

Outer join with your smaller DF:

df.join(df_nan, how='outer')

2014-02-01 09:58:03  1.576119  0.0   8.355  0.0  0.0  1.0  0.0
2014-02-01 09:58:33  1.576119  0.0  13.371  0.0  0.0  1.0  0.0
2014-02-01 09:59:03  1.576119  0.0  13.833  0.0  0.0  1.0  0.0
2014-02-01 09:59:33       NaN  NaN     NaN  NaN  NaN  NaN  NaN
2014-02-01 10:00:03       NaN  NaN     NaN  NaN  NaN  NaN  NaN

Upvotes: 0

sitting_duck
sitting_duck

Reputation: 3730

If it is the case that you have, say, two DF's - df_big and df_small. And the row indices in df_small match the beginning row indices of df_big you could:

Add the NaN rows as you describe above so that the number of rows in df_small matches the number of rows in df_big.

Then copy the index from df_big to df_small.

df_small.index = df_big.index

A different idea:

You could use the time delta between the last two rows to generate new index entries.

Set number of entries to add.

rows_to_add = 2

Create a new and extended index based on your original DF - before you add the NaN rows:

ext_index = list(df.index) + \
    [df.index[-1] + (df.index[-1] - df.index[-2]) * x for x in range(1,rows_to_add+1)]

[Timestamp('2014-02-01 09:58:03'),
 Timestamp('2014-02-01 09:58:33'),
 Timestamp('2014-02-01 09:59:03'),
 Timestamp('2014-02-01 09:59:33'),
 Timestamp('2014-02-01 10:00:03')]

Then add your NaN rows as in your question. (The same number of rows as the constant rows_to_add).

Then set your new index:

df.index = ext_index

Upvotes: 1

Anurag Dabas
Anurag Dabas

Reputation: 24322

After you done concatinating:

Make use of pd.date_range() and index attribute:

df.index=pd.date_range(df.index[0],periods=len(df),freq='30S')

output of df:

2014-02-01 09:58:03  1.576119   0.0   8.355      0.0    0.0        1.0    0.0
2014-02-01 09:58:33  1.576119   0.0  13.371      0.0    0.0        1.0    0.0
2014-02-01 09:59:03  1.576119   0.0  13.833      0.0    0.0        1.0    0.0
2014-02-01 09:59:33       NaN   NaN     NaN      NaN    NaN        NaN    NaN
2014-02-01 10:00:03       NaN   NaN     NaN      NaN    NaN        NaN    NaN

Upvotes: 1

Related Questions