Reputation: 1190
I have the following code:
import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime
TOKEN = "d0d2a3295349c625be6c0cbe23f9136221eb45ef"
con = fxcmpy.fxcmpy(access_token=TOKEN, log_level='error')
symbols = con.get_instruments()
start = datetime.datetime(2015,1,1)
end = datetime.datetime.today()
data = con.get_candles('NGAS', period='D1', start = start, end = end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')
data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)
The last line data = data.reindex(full_dates)
gives me the following error:
ValueError: cannot reindex from a duplicate axis
What i am trying to do is fill the missing dates and reindex the column.
As mentioned by @jezrael "problem is duplicated values in DatetimeIndex, so reindex cannot be used here"
I have used the same code earlier and it worked fine. Curious why it is not working in this case
import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime
import numpy as np
stock = 'F'
start = datetime.date(2008,1,1)
end = datetime.date.today()
data = web.DataReader(stock, 'yahoo',start, end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')
full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)
The code is the same except the provider but this one works and the one above did not?
Upvotes: 0
Views: 290
Reputation: 862591
So problem is duplicated values in DatetimeIndex
, so reindex
cannot be used here.
Possible solution is use DataFrame.join
with helper DataFrame
by all values:
data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
df = pd.DataFrame({'date':full_dates}).join(data, on='date')
print (df)
date bidopen bidclose bidhigh bidlow askopen askclose \
0 2015-01-01 NaN NaN NaN NaN NaN NaN
1 2015-01-02 2.9350 2.947 3.0910 2.860 2.9450 2.957
2 2015-01-03 NaN NaN NaN NaN NaN NaN
3 2015-01-04 NaN NaN NaN NaN NaN NaN
4 2015-01-05 2.9470 2.912 3.1710 2.871 2.9570 2.922
... ... ... ... ... ... ...
1797 2019-12-03 2.3890 2.441 2.5115 2.371 2.3970 2.449
1798 2019-12-04 2.3455 2.392 2.3970 2.341 2.3535 2.400
1798 2019-12-04 2.4410 2.406 2.4645 2.370 2.4490 2.414
1799 2019-12-05 2.4060 2.421 2.4650 2.399 2.4140 2.429
1800 2019-12-06 NaN NaN NaN NaN NaN NaN
askhigh asklow tickqty
0 NaN NaN NaN
1 3.101 2.8700 12688.0
2 NaN NaN NaN
3 NaN NaN NaN
4 3.181 2.8810 21849.0
... ... ...
1797 2.519 2.3785 36679.0
1798 2.406 2.3505 5333.0
1798 2.473 2.3780 74881.0
1799 2.473 2.4070 29238.0
1800 NaN NaN NaN
[1802 rows x 10 columns]
But I think next processing should be problematic (because duplicated index), so use DataFrame.resample
by days with aggregation functions in dictionary:
df = data.resample('D').agg({'bidopen': 'first',
'bidclose': 'last',
'bidhigh': 'max',
'bidlow': 'min',
'askopen': 'first',
'askclose': 'last',
'askhigh': 'max',
'asklow': 'min',
'tickqty':'sum'})
print (df)
bidopen bidclose bidhigh bidlow askopen askclose askhigh \
date
2015-01-02 2.9350 2.9470 3.0910 2.860 2.9450 2.9570 3.101
2015-01-03 NaN NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN NaN
2015-01-05 2.9470 2.9120 3.1710 2.871 2.9570 2.9220 3.181
2015-01-06 2.9120 2.9400 2.9510 2.807 2.9220 2.9500 2.961
... ... ... ... ... ... ...
2019-12-01 NaN NaN NaN NaN NaN NaN NaN
2019-12-02 2.3505 2.3455 2.3670 2.292 2.3590 2.3535 2.375
2019-12-03 2.3890 2.4410 2.5115 2.371 2.3970 2.4490 2.519
2019-12-04 2.3455 2.4060 2.4645 2.341 2.3535 2.4140 2.473
2019-12-05 2.4060 2.4210 2.4650 2.399 2.4140 2.4290 2.473
asklow tickqty
date
2015-01-02 2.8700 12688
2015-01-03 NaN 0
2015-01-04 NaN 0
2015-01-05 2.8810 21849
2015-01-06 2.8170 17955
... ...
2019-12-01 NaN 0
2019-12-02 2.3000 31173
2019-12-03 2.3785 36679
2019-12-04 2.3505 80214
2019-12-05 2.4070 29238
[1799 rows x 9 columns]
Upvotes: 1