Fill missing days in timeseries (with duplicate keys)

Question

Here is what I'm trying to do in Pandas:

load CSV file containing information about stocks for certain days
find the earliest and latest dates in the column date
create a new dataframe where all the days between the earliest and latest are filled (NaN or something like "missing" for all columns would be fine)

Currently it looks like this:

import pandas as pd
import dateutil

df = pd.read_csv("https://dl.dropboxusercontent.com/u/84641/temp/berkshire_new.csv")
df['date'] = df['date'].apply(dateutil.parser.parse)
new_date_range = pd.date_range(df['date'].min(), df['date'].max())
df = df.set_index('date')
df.reindex(new_date_range)

Unfortunately this throws the following error which I don't quite understand:

ValueError: Shape of passed values is (3, 4825), indices imply (3, 4384)

I've tried a dozen variations of this - without any luck. Any help would be much appreciated.

Edit:

After investigating this further, it looks like the problem is caused by duplicate indexes. The CSV does contain several entries for each date, which is probably causing the errors.

The question is still relevant though: How can I fill the gaps in between, although there are duplicate entries for each date?

Jeff · Accepted Answer

So you have duplicates when considering symbol,date,action.

In [99]: df.head(10)
Out[99]: 
  symbol                date      change    action
0    FDC 2001-08-15 00:00:00   15.069360       new
1    GPS 2001-08-15 00:00:00   19.653780       new
2    HON 2001-08-15 00:00:00    8.604316       new
3    LIZ 2001-08-15 00:00:00    6.711568       new
4    NKE 2001-08-15 00:00:00   22.686257       new
5    ODP 2001-08-15 00:00:00    5.686902       new
6    OSI 2001-08-15 00:00:00    5.893340       new
7    USB 2001-08-15 00:00:00   15.694478       new
8    NEE 2001-11-15 00:00:00  100.000000       new
9    GPS 2001-11-15 00:00:00  142.522231  increase

Create the new date index

In [102]: idx = pd.date_range(df.date.min(),df.date.max())

In [103]: idx
Out[103]: 

[2001-08-15 00:00:00, ..., 2013-08-15 00:00:00]
Length: 4384, Freq: D, Timezone: None

This will, group by symbol and action Then reindex that set to the full dates (idx) Select out the only remaining column (change) As now the index is symbol/date

In [100]: df.groupby(['symbol','action']).apply(
              lambda x: x.set_index('date').reindex(idx)
                  )['change'].reset_index(level=1).head()

Out[100]: 
                     action  change
symbol                             
ADM    2001-08-15  decrease     NaN
       2001-08-16  decrease     NaN
       2001-08-17  decrease     NaN
       2001-08-18  decrease     NaN
       2001-08-19  decrease     NaN

In [101]: df.groupby(['symbol','action']).apply(lambda x: x.set_index('date').reindex(idx))['change'].reset_index(level=1)
Out[101]: 

MultiIndex: 977632 entries, (ADM, 2001-08-15 00:00:00) to (svm, 2013-08-15 00:00:00)
Data columns (total 2 columns):
action    977632  non-null values
change    490  non-null values
dtypes: float64(1), object(1)

You can then fill forward or whatever you need. FYI, not sure what you are going to do with this, but this is not a very common type of operation as you have mostly empty data.

Fill missing days in timeseries (with duplicate keys)

Answers (2)

Related Questions