Reputation: 143
Here is a snapshot of a dataframe df
I'm working with
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
With Timestamp indices spaced by 30 seconds. I'm trying to concatenate a number of rows populated by np.nan
values, while keeping with the pattern of 30 second separated Timestamp indices, i.e. something that would look like
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
2014-02-01 09:59:33 NaN NaN NaN NaN NaN NaN NaN
2014-02-01 10:00:03 NaN NaN NaN NaN NaN NaN NaN
However, when I apply
df = pd.concat(df, pd.DataFrame(np.array([np.nan, np.nan])))
I'm instead left with
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
My question is: how I can get the timestamp index pattern to continue? Is there something I should specify in the creation of the dataframe to be concatenated, or can I re-index the dataframe shown above?
For a more complete problem statement, I'm working with several time-series dataframes, each of which is hundreds of thousands of rows with the same initial time, varying ending times, and some missing values for each dataframe; I'm trying to get them to match lengths so I can interpolate NaN values by an np.nanmean()
at that element's index over all dataframes, which I'm doing by stacking the associated numpy arrays for each dataframe -- applying this averaging procedure across the arrays requires them to have the same dimensions, hence I am filling them with NaNs and interpolating.
Upvotes: 0
Views: 428
Reputation: 3730
Another idea. (Doesn't directly answer your question but might help). This would be useful in situations where not all of your missing data is at the end of the frame.
Create a DF with date range in index:
df_nan = pd.DataFrame(
index=pd.date_range('2014-02-01 09:58:03',periods=5,freq='30S')
)
Outer join with your smaller DF
:
df.join(df_nan, how='outer')
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
2014-02-01 09:59:33 NaN NaN NaN NaN NaN NaN NaN
2014-02-01 10:00:03 NaN NaN NaN NaN NaN NaN NaN
Upvotes: 0
Reputation: 3730
If it is the case that you have, say, two DF
's - df_big
and df_small
. And the row indices in df_small
match the beginning row indices of df_big
you could:
Add the NaN
rows as you describe above so that the number of rows in df_small
matches the number of rows in df_big
.
Then copy the index from df_big
to df_small
.
df_small.index = df_big.index
A different idea:
You could use the time delta between the last two rows to generate new index entries.
Set number of entries to add.
rows_to_add = 2
Create a new and extended index based on your original DF
- before you add the NaN
rows:
ext_index = list(df.index) + \
[df.index[-1] + (df.index[-1] - df.index[-2]) * x for x in range(1,rows_to_add+1)]
[Timestamp('2014-02-01 09:58:03'),
Timestamp('2014-02-01 09:58:33'),
Timestamp('2014-02-01 09:59:03'),
Timestamp('2014-02-01 09:59:33'),
Timestamp('2014-02-01 10:00:03')]
Then add your NaN
rows as in your question. (The same number of rows as the constant rows_to_add
).
Then set your new index:
df.index = ext_index
Upvotes: 1
Reputation: 24322
After you done concatinating:
Make use of pd.date_range()
and index
attribute:
df.index=pd.date_range(df.index[0],periods=len(df),freq='30S')
output of df
:
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
2014-02-01 09:59:33 NaN NaN NaN NaN NaN NaN NaN
2014-02-01 10:00:03 NaN NaN NaN NaN NaN NaN NaN
Upvotes: 1