Reputation: 3058

pandas: Fill missing dates when keeping duplicates

I have a simple pandas series:

import pandas as pd

quantities = [1, 14, 14, 11, 12, 13, 14]
timestamps = [pd.Timestamp(2015, 4, 1), pd.Timestamp(2015, 4, 1), pd.Timestamp(2015, 4, 2), pd.Timestamp(2015, 4, 3), pd.Timestamp(2015, 4, 4), pd.Timestamp(2015, 4, 5), pd.Timestamp(2015, 4, 8)]
series = pd.Series(quantities, index=timestamps)

which looks as follows:

2015-04-01     1
2015-04-01    14
2015-04-02    14
2015-04-03    11
2015-04-04    12
2015-04-05    13
2015-04-08    14
dtype: int64

I would like to fill the missing dates i.e. 2015-04-06 = NaN and 2015-04-07 = NaN but keep the series as is, i.e.:

2015-04-01     1
2015-04-01    14
2015-04-02    14
2015-04-03    11
2015-04-04    12
2015-04-05    13
2015-04-06    NaN
2015-04-07    NaN
2015-04-08    14
dtype: int64

I've tried:

series = series.asfreq('D')

but get the following error: ValueError: cannot reindex from a duplicate axis. This error happens because of the duplicate timestamp values.

Is there any way on Earth to achieve this?

Thanks for any help.

Upvotes: 6

Answers (4)

Scott Boston

Reputation: 153460

Let's try:

s = pd.Series(np.nan, index=pd.date_range(series.index.min(), series.index.max(), freq='D'))
pd.concat([series,s[~s.index.isin(series.index)]]).sort_index()

Output:

2015-04-01     1.0
2015-04-01    14.0
2015-04-02    14.0
2015-04-03    11.0
2015-04-04    12.0
2015-04-05    13.0
2015-04-06     NaN
2015-04-07     NaN
2015-04-08    14.0
dtype: float64

Timings:

%%timeit
temp = series[~series.index.duplicated(keep='first')].asfreq('D')
pd.concat([series, temp.loc[~temp.index.isin(series.index)]]).sort_index()

2.51 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
series.name = "x"
calendar = pd.DataFrame(None, index=pd.DatetimeIndex(start=series.index.min(), end=series.index.max(), freq='D'))
calendar.join(series)

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Creating a DatetimeIndex by passing range endpoints is deprecated. Use pandas.date_range instead.

2.07 ms ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
s = pd.Series(np.nan, index=pd.date_range(series.index.min(), series.index.max(), freq='D'))
pd.concat([series,s[~s.index.isin(series.index)]]).sort_index()

1.86 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Thanks @root for this suggestion.

%%timeit
s = pd.Series(index=pd.date_range(series.index.min(), series.index.max(), freq='D')\
                      .difference(series.index))
pd.concat([series,s]).sort_index()

1.55 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 7

Terry

Reputation: 2811

You can use asfreq removing the index duplicates and then concat where temp.index is not in original Serie

temp = series[~series.index.duplicated(keep='first')].asfreq('D')
pd.concat([series, temp.loc[~temp.index.isin(series.index)]]).sort_index()

output:
2015-04-01     1.0
2015-04-01    14.0
2015-04-02    14.0
2015-04-03    11.0
2015-04-04    12.0
2015-04-05    13.0
2015-04-06     NaN
2015-04-07     NaN
2015-04-08    14.0
dtype: float64

Upvotes: 0

ralex

Reputation: 388

You can use pandas.concat. Adding to your example code:

series2 = pd.Series([pd.np.nan, pd.np.nan],
                    index=[pd.Timestamp(2015, 4, 6), 
                           pd.Timestamp(2015, 4, 7)])

pd.concat([series, series2], axis=0).sort_index()

returns

2015-04-01     1.0
2015-04-01    14.0
2015-04-02    14.0
2015-04-03    11.0
2015-04-04    12.0
2015-04-05    13.0
2015-04-06     NaN
2015-04-07     NaN
2015-04-08    14.0
dtype: float64

That said, you are leaving yourself open to further difficulties by using a non-unique index. You would benefit from having a unique index level or non-index field to use for disambiguation.

Upvotes: 0

arinarmo

Reputation: 375

This should be enough, assuming you don't have millions of rows:

series.name = "x"
calendar = pd.DataFrame(None, index=pd.DatetimeIndex(start=series.index.min(), end=series.index.max(), freq='D'))
calendar.join(series)

Output:

               x
2015-04-01   1.0
2015-04-01  14.0
2015-04-02  14.0
2015-04-03  11.0
2015-04-04  12.0
2015-04-05  13.0
2015-04-06   NaN
2015-04-07   NaN
2015-04-08  14.0

If you want a series you can access the column x of the resulting DataFrame: calendar.join(series).x

Upvotes: 1

pandas: Fill missing dates when keeping duplicates

Answers (4)

Related Questions