Reputation: 3058
I have a simple pandas
series:
import pandas as pd
quantities = [1, 14, 14, 11, 12, 13, 14]
timestamps = [pd.Timestamp(2015, 4, 1), pd.Timestamp(2015, 4, 1), pd.Timestamp(2015, 4, 2), pd.Timestamp(2015, 4, 3), pd.Timestamp(2015, 4, 4), pd.Timestamp(2015, 4, 5), pd.Timestamp(2015, 4, 8)]
series = pd.Series(quantities, index=timestamps)
which looks as follows:
2015-04-01 1
2015-04-01 14
2015-04-02 14
2015-04-03 11
2015-04-04 12
2015-04-05 13
2015-04-08 14
dtype: int64
I would like to fill the missing dates i.e. 2015-04-06 = NaN
and 2015-04-07 = NaN
but keep the series as is, i.e.:
2015-04-01 1
2015-04-01 14
2015-04-02 14
2015-04-03 11
2015-04-04 12
2015-04-05 13
2015-04-06 NaN
2015-04-07 NaN
2015-04-08 14
dtype: int64
I've tried:
series = series.asfreq('D')
but get the following error: ValueError: cannot reindex from a duplicate axis. This error happens because of the duplicate timestamp values.
Is there any way on Earth to achieve this?
Thanks for any help.
Upvotes: 6
Views: 1412
Reputation: 153460
Let's try:
s = pd.Series(np.nan, index=pd.date_range(series.index.min(), series.index.max(), freq='D'))
pd.concat([series,s[~s.index.isin(series.index)]]).sort_index()
Output:
2015-04-01 1.0
2015-04-01 14.0
2015-04-02 14.0
2015-04-03 11.0
2015-04-04 12.0
2015-04-05 13.0
2015-04-06 NaN
2015-04-07 NaN
2015-04-08 14.0
dtype: float64
Timings:
%%timeit
temp = series[~series.index.duplicated(keep='first')].asfreq('D')
pd.concat([series, temp.loc[~temp.index.isin(series.index)]]).sort_index()
2.51 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
series.name = "x"
calendar = pd.DataFrame(None, index=pd.DatetimeIndex(start=series.index.min(), end=series.index.max(), freq='D'))
calendar.join(series)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Creating a DatetimeIndex by passing range endpoints is deprecated. Use
pandas.date_range
instead.2.07 ms ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(np.nan, index=pd.date_range(series.index.min(), series.index.max(), freq='D'))
pd.concat([series,s[~s.index.isin(series.index)]]).sort_index()
1.86 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Thanks @root for this suggestion.
%%timeit
s = pd.Series(index=pd.date_range(series.index.min(), series.index.max(), freq='D')\
.difference(series.index))
pd.concat([series,s]).sort_index()
1.55 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 7
Reputation: 2811
You can use asfreq
removing the index duplicates and then concat where temp.index
is not in original Serie
temp = series[~series.index.duplicated(keep='first')].asfreq('D')
pd.concat([series, temp.loc[~temp.index.isin(series.index)]]).sort_index()
output:
2015-04-01 1.0
2015-04-01 14.0
2015-04-02 14.0
2015-04-03 11.0
2015-04-04 12.0
2015-04-05 13.0
2015-04-06 NaN
2015-04-07 NaN
2015-04-08 14.0
dtype: float64
Upvotes: 0
Reputation: 388
You can use pandas.concat
. Adding to your example code:
series2 = pd.Series([pd.np.nan, pd.np.nan],
index=[pd.Timestamp(2015, 4, 6),
pd.Timestamp(2015, 4, 7)])
pd.concat([series, series2], axis=0).sort_index()
returns
2015-04-01 1.0
2015-04-01 14.0
2015-04-02 14.0
2015-04-03 11.0
2015-04-04 12.0
2015-04-05 13.0
2015-04-06 NaN
2015-04-07 NaN
2015-04-08 14.0
dtype: float64
That said, you are leaving yourself open to further difficulties by using a non-unique index. You would benefit from having a unique index level or non-index field to use for disambiguation.
Upvotes: 0
Reputation: 375
This should be enough, assuming you don't have millions of rows:
series.name = "x"
calendar = pd.DataFrame(None, index=pd.DatetimeIndex(start=series.index.min(), end=series.index.max(), freq='D'))
calendar.join(series)
Output:
x
2015-04-01 1.0
2015-04-01 14.0
2015-04-02 14.0
2015-04-03 11.0
2015-04-04 12.0
2015-04-05 13.0
2015-04-06 NaN
2015-04-07 NaN
2015-04-08 14.0
If you want a series you can access the column x of the resulting DataFrame: calendar.join(series).x
Upvotes: 1