Slartibartfast
Slartibartfast

Reputation: 1190

reindex from a duplicate axis

I have the following code:

import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime

TOKEN = "d0d2a3295349c625be6c0cbe23f9136221eb45ef"
con = fxcmpy.fxcmpy(access_token=TOKEN, log_level='error')
symbols = con.get_instruments()

start = datetime.datetime(2015,1,1)
end = datetime.datetime.today()
data = con.get_candles('NGAS', period='D1', start = start, end = end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')
data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)

The last line data = data.reindex(full_dates) gives me the following error:

ValueError: cannot reindex from a duplicate axis

What i am trying to do is fill the missing dates and reindex the column.


As mentioned by @jezrael "problem is duplicated values in DatetimeIndex, so reindex cannot be used here"

I have used the same code earlier and it worked fine. Curious why it is not working in this case

import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime
import numpy as np

stock = 'F'
start = datetime.date(2008,1,1)
end = datetime.date.today()
data = web.DataReader(stock, 'yahoo',start, end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')

full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)

The code is the same except the provider but this one works and the one above did not?

Upvotes: 0

Views: 290

Answers (1)

jezrael
jezrael

Reputation: 862591

So problem is duplicated values in DatetimeIndex, so reindex cannot be used here.

Possible solution is use DataFrame.join with helper DataFrame by all values:

data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
df = pd.DataFrame({'date':full_dates}).join(data, on='date')
print (df)
           date  bidopen  bidclose  bidhigh  bidlow  askopen  askclose  \
0    2015-01-01      NaN       NaN      NaN     NaN      NaN       NaN   
1    2015-01-02   2.9350     2.947   3.0910   2.860   2.9450     2.957   
2    2015-01-03      NaN       NaN      NaN     NaN      NaN       NaN   
3    2015-01-04      NaN       NaN      NaN     NaN      NaN       NaN   
4    2015-01-05   2.9470     2.912   3.1710   2.871   2.9570     2.922   
        ...      ...       ...      ...     ...      ...       ...   
1797 2019-12-03   2.3890     2.441   2.5115   2.371   2.3970     2.449   
1798 2019-12-04   2.3455     2.392   2.3970   2.341   2.3535     2.400   
1798 2019-12-04   2.4410     2.406   2.4645   2.370   2.4490     2.414   
1799 2019-12-05   2.4060     2.421   2.4650   2.399   2.4140     2.429   
1800 2019-12-06      NaN       NaN      NaN     NaN      NaN       NaN   

      askhigh  asklow  tickqty  
0         NaN     NaN      NaN  
1       3.101  2.8700  12688.0  
2         NaN     NaN      NaN  
3         NaN     NaN      NaN  
4       3.181  2.8810  21849.0  
      ...     ...      ...  
1797    2.519  2.3785  36679.0  
1798    2.406  2.3505   5333.0  
1798    2.473  2.3780  74881.0  
1799    2.473  2.4070  29238.0  
1800      NaN     NaN      NaN  

[1802 rows x 10 columns]

But I think next processing should be problematic (because duplicated index), so use DataFrame.resample by days with aggregation functions in dictionary:

df = data.resample('D').agg({'bidopen': 'first', 
                             'bidclose': 'last',
                             'bidhigh': 'max', 
                             'bidlow': 'min', 
                             'askopen': 'first', 
                             'askclose': 'last',
                             'askhigh': 'max', 
                             'asklow': 'min', 
                             'tickqty':'sum'})

print (df)
            bidopen  bidclose  bidhigh  bidlow  askopen  askclose  askhigh  \
date                                                                         
2015-01-02   2.9350    2.9470   3.0910   2.860   2.9450    2.9570    3.101   
2015-01-03      NaN       NaN      NaN     NaN      NaN       NaN      NaN   
2015-01-04      NaN       NaN      NaN     NaN      NaN       NaN      NaN   
2015-01-05   2.9470    2.9120   3.1710   2.871   2.9570    2.9220    3.181   
2015-01-06   2.9120    2.9400   2.9510   2.807   2.9220    2.9500    2.961   
            ...       ...      ...     ...      ...       ...      ...   
2019-12-01      NaN       NaN      NaN     NaN      NaN       NaN      NaN   
2019-12-02   2.3505    2.3455   2.3670   2.292   2.3590    2.3535    2.375   
2019-12-03   2.3890    2.4410   2.5115   2.371   2.3970    2.4490    2.519   
2019-12-04   2.3455    2.4060   2.4645   2.341   2.3535    2.4140    2.473   
2019-12-05   2.4060    2.4210   2.4650   2.399   2.4140    2.4290    2.473   

            asklow  tickqty  
date                         
2015-01-02  2.8700    12688  
2015-01-03     NaN        0  
2015-01-04     NaN        0  
2015-01-05  2.8810    21849  
2015-01-06  2.8170    17955  
           ...      ...  
2019-12-01     NaN        0  
2019-12-02  2.3000    31173  
2019-12-03  2.3785    36679  
2019-12-04  2.3505    80214  
2019-12-05  2.4070    29238  

[1799 rows x 9 columns]

Upvotes: 1

Related Questions