Reputation: 698
Here is what I'm trying to do in Pandas:
date
Currently it looks like this:
import pandas as pd
import dateutil
df = pd.read_csv("https://dl.dropboxusercontent.com/u/84641/temp/berkshire_new.csv")
df['date'] = df['date'].apply(dateutil.parser.parse)
new_date_range = pd.date_range(df['date'].min(), df['date'].max())
df = df.set_index('date')
df.reindex(new_date_range)
Unfortunately this throws the following error which I don't quite understand:
ValueError: Shape of passed values is (3, 4825), indices imply (3, 4384)
I've tried a dozen variations of this - without any luck. Any help would be much appreciated.
Edit:
After investigating this further, it looks like the problem is caused by duplicate indexes. The CSV does contain several entries for each date, which is probably causing the errors.
The question is still relevant though: How can I fill the gaps in between, although there are duplicate entries for each date?
Upvotes: 1
Views: 1668
Reputation: 31
I'm having a similar problem at the moment, I think you shouldn't use reindex but something like asfreq or resample.
with them you don't need to create an index, thy will.
Upvotes: 0
Reputation: 128918
So you have duplicates when considering symbol,date,action.
In [99]: df.head(10)
Out[99]:
symbol date change action
0 FDC 2001-08-15 00:00:00 15.069360 new
1 GPS 2001-08-15 00:00:00 19.653780 new
2 HON 2001-08-15 00:00:00 8.604316 new
3 LIZ 2001-08-15 00:00:00 6.711568 new
4 NKE 2001-08-15 00:00:00 22.686257 new
5 ODP 2001-08-15 00:00:00 5.686902 new
6 OSI 2001-08-15 00:00:00 5.893340 new
7 USB 2001-08-15 00:00:00 15.694478 new
8 NEE 2001-11-15 00:00:00 100.000000 new
9 GPS 2001-11-15 00:00:00 142.522231 increase
Create the new date index
In [102]: idx = pd.date_range(df.date.min(),df.date.max())
In [103]: idx
Out[103]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2001-08-15 00:00:00, ..., 2013-08-15 00:00:00]
Length: 4384, Freq: D, Timezone: None
This will, group by symbol and action Then reindex that set to the full dates (idx) Select out the only remaining column (change) As now the index is symbol/date
In [100]: df.groupby(['symbol','action']).apply(
lambda x: x.set_index('date').reindex(idx)
)['change'].reset_index(level=1).head()
Out[100]:
action change
symbol
ADM 2001-08-15 decrease NaN
2001-08-16 decrease NaN
2001-08-17 decrease NaN
2001-08-18 decrease NaN
2001-08-19 decrease NaN
In [101]: df.groupby(['symbol','action']).apply(lambda x: x.set_index('date').reindex(idx))['change'].reset_index(level=1)
Out[101]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 977632 entries, (ADM, 2001-08-15 00:00:00) to (svm, 2013-08-15 00:00:00)
Data columns (total 2 columns):
action 977632 non-null values
change 490 non-null values
dtypes: float64(1), object(1)
You can then fill forward or whatever you need. FYI, not sure what you are going to do with this, but this is not a very common type of operation as you have mostly empty data.
Upvotes: 2