herrherr
herrherr

Reputation: 698

Fill missing days in timeseries (with duplicate keys)

Here is what I'm trying to do in Pandas:

Currently it looks like this:

import pandas as pd
import dateutil

df = pd.read_csv("https://dl.dropboxusercontent.com/u/84641/temp/berkshire_new.csv")
df['date'] = df['date'].apply(dateutil.parser.parse)
new_date_range = pd.date_range(df['date'].min(), df['date'].max())
df = df.set_index('date')
df.reindex(new_date_range)

Unfortunately this throws the following error which I don't quite understand:

ValueError: Shape of passed values is (3, 4825), indices imply (3, 4384)

I've tried a dozen variations of this - without any luck. Any help would be much appreciated.

Edit:

After investigating this further, it looks like the problem is caused by duplicate indexes. The CSV does contain several entries for each date, which is probably causing the errors.

The question is still relevant though: How can I fill the gaps in between, although there are duplicate entries for each date?

Upvotes: 1

Views: 1668

Answers (2)

user2616532
user2616532

Reputation: 31

I'm having a similar problem at the moment, I think you shouldn't use reindex but something like asfreq or resample.

with them you don't need to create an index, thy will.

Upvotes: 0

Jeff
Jeff

Reputation: 128918

So you have duplicates when considering symbol,date,action.

In [99]: df.head(10)
Out[99]: 
  symbol                date      change    action
0    FDC 2001-08-15 00:00:00   15.069360       new
1    GPS 2001-08-15 00:00:00   19.653780       new
2    HON 2001-08-15 00:00:00    8.604316       new
3    LIZ 2001-08-15 00:00:00    6.711568       new
4    NKE 2001-08-15 00:00:00   22.686257       new
5    ODP 2001-08-15 00:00:00    5.686902       new
6    OSI 2001-08-15 00:00:00    5.893340       new
7    USB 2001-08-15 00:00:00   15.694478       new
8    NEE 2001-11-15 00:00:00  100.000000       new
9    GPS 2001-11-15 00:00:00  142.522231  increase

Create the new date index

In [102]: idx = pd.date_range(df.date.min(),df.date.max())

In [103]: idx
Out[103]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2001-08-15 00:00:00, ..., 2013-08-15 00:00:00]
Length: 4384, Freq: D, Timezone: None

This will, group by symbol and action Then reindex that set to the full dates (idx) Select out the only remaining column (change) As now the index is symbol/date

In [100]: df.groupby(['symbol','action']).apply(
              lambda x: x.set_index('date').reindex(idx)
                  )['change'].reset_index(level=1).head()

Out[100]: 
                     action  change
symbol                             
ADM    2001-08-15  decrease     NaN
       2001-08-16  decrease     NaN
       2001-08-17  decrease     NaN
       2001-08-18  decrease     NaN
       2001-08-19  decrease     NaN

In [101]: df.groupby(['symbol','action']).apply(lambda x: x.set_index('date').reindex(idx))['change'].reset_index(level=1)
Out[101]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 977632 entries, (ADM, 2001-08-15 00:00:00) to (svm, 2013-08-15 00:00:00)
Data columns (total 2 columns):
action    977632  non-null values
change    490  non-null values
dtypes: float64(1), object(1)

You can then fill forward or whatever you need. FYI, not sure what you are going to do with this, but this is not a very common type of operation as you have mostly empty data.

Upvotes: 2

Related Questions