Nikhil Mulley
Nikhil Mulley

Reputation: 736

ValueError: cannot join with no overlapping index names on the same dataframe

I am encountering a strange problem with a pandas dataframe where in, where() fails complaining that it cannot join on the overlapping index names.

To reproduce this problem try below:

import yfinance as yf
from datetime import datetime
startdate=datetime(2022,12,1)
enddate=datetime(2022,12,6)
y_symbols = ['GOOG', 'AAPL', 'MSFT']
data=yf.download(y_symbols, start=startdate, end=enddate, auto_adjust=True, threads=True)
data[data['Close'] > 100]

Then the raised error looks like:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
..
  File "lib/python3.9/site-packages/pandas/core/indexes/base.py", line 229, in join
    join_index, lidx, ridx = meth(self, other, how=how, level=level, sort=sort)
  File "lib/python3.9/site-packages/pandas/core/indexes/base.py", line 4658, in join
    return self._join_multi(other, how=how)
  File "lib/python3.9/site-packages/pandas/core/indexes/base.py", line 4782, in _join_multi
    raise ValueError("cannot join with no overlapping index names")
ValueError: cannot join with no overlapping index names

Here, data looks like:

                 Close                                High                          ...        Open                            Volume
                  AAPL        GOOG        MSFT        AAPL        GOOG        MSFT  ...        AAPL        GOOG        MSFT      AAPL      GOOG      MSFT
Date                                                                                ...
2022-12-01  148.309998  101.279999  254.690002  149.130005  102.589996  256.119995  ...  148.210007  101.400002  253.869995  71250400  21771500  26041500
2022-12-02  147.809998  100.830002  255.020004  148.000000  101.150002  256.059998  ...  145.960007   99.370003  249.820007  65421400  18812200  21522800
2022-12-05  146.630005   99.870003  250.199997  150.919998  101.750000  253.820007  ...  147.770004   99.815002  252.009995  68826400  19955500  23435300

What could be missing here in the dataframe that this would not work?

Upvotes: 1

Views: 11750

Answers (4)

cottontail
cottontail

Reputation: 23281

Boolean indexing, where(), mask() etc. aligns both columns and indices using align(), which needs level to align on if either column or indices are MultiIndex. The most basic way to reproduce the error is the following where an align() call is used to align columns and indices of two dataframes. By default, it tries on all levels but because there is a mismatch in the number of levels, we get the error in the title. The solution is to pass the level=.

df1 = pd.DataFrame({('a', 'b'): [1, 2]})
df2 = pd.DataFrame({'a': [1, 2]})
df1.align(df2)             # <--- ValueError: cannot join with no overlapping index names
df1.align(df2, level=0)    # <--- OK

However, with boolean indexing, we can't set the level as in align(), so we must set level names to the MultiIndex (as OP observed) to solve the issue.

df1 = pd.DataFrame({('a', 'b'): [1, 2]})
df1.columns.names = ['f', 's']
df1[df1['a']>0]            # <--- OK

Setting level names to the columns lets pandas align the columns accordingly.


Another way this error shows up is if we use mask or where to replace values but the original and replacement dataframes have different number of levels. In that case, the solution is to pass the level to align on:

df1 = pd.DataFrame({('a', 'b'): [1, 2]})
df1.where(df1>1, pd.DataFrame({'b': [10,20]}))            # <--- ValueError:
df1.where(df1>1, pd.DataFrame({'b': [10,20]}), level=1)   # <--- OK

In the above example, since the aligning column is column b in level=1, we must pass it to make it work correctly.

Upvotes: 0

Nikhil Mulley
Nikhil Mulley

Reputation: 736

Thanks to @JayPeerachai leading answer for a direction. As to why this was working before: please check the solution section in the question at here.

I was using pandas datareader to fetch data from yahoo finance until it broke yesterday and was not able to get data, this prompted me to switch to yfinance but as it turns out getting the results was okay but the data itself was somehow not allowing the where function queries despite being the same dataset. Hence, this question.

Further probing into pandas datareader and yfinance code helped me to understand that the column names were different, and pandas datareader actually sets the column names before returning the dataset to user code.

All it needed was, setting the column names after fetching the data and was able to run the where function queries iteratively too.

So the solution is to set the column names after fetching the result from yfinance.

Ideally, would like yfinance itself to take care of this.

>>> import yfinance as yf
>>> from datetime import datetime
>>> startdate=datetime(2022,12,1)
>>> enddate=datetime(2022,12,6)
>>> y_symbols = ['GOOG', 'AAPL', 'MSFT']
>>> data=yf.download(y_symbols, start=startdate, end=enddate, auto_adjust=True, threads=True)
[*********************100%***********************]  3 of 3 completed
>>> data.columns.names = ["Attributes", "Symbols"]
>>> data[data['Close'] > 100]
Attributes       Close                                High                                 Low                                Open                            Volume
Symbols           AAPL        GOOG        MSFT        AAPL        GOOG        MSFT        AAPL        GOOG        MSFT        AAPL        GOOG        MSFT      AAPL        GOOG      MSFT
Date
2022-12-01  148.309998  101.279999  254.690002  149.130005  102.589996  256.119995  146.610001  100.669998  250.919998  148.210007  101.400002  253.869995  71250400  21771500.0  26041500
2022-12-02  147.809998  100.830002  255.020004  148.000000  101.150002  256.059998  145.649994   99.169998  249.690002  145.960007   99.370003  249.820007  65421400  18812200.0  21522800
2022-12-05  146.630005         NaN  250.199997  150.919998         NaN  253.820007  145.770004         NaN  248.059998  147.770004         NaN  252.009995  68826400         NaN  23435300
>>>

>>> data.where(data['Close'] > 100).where(data['High'] > 120)
Attributes       Close                         High                          Low                         Open                     Volume
Symbols           AAPL GOOG        MSFT        AAPL GOOG        MSFT        AAPL GOOG        MSFT        AAPL GOOG        MSFT      AAPL GOOG      MSFT
Date
2022-12-01  148.309998  NaN  254.690002  149.130005  NaN  256.119995  146.610001  NaN  250.919998  148.210007  NaN  253.869995  71250400  NaN  26041500
2022-12-02  147.809998  NaN  255.020004  148.000000  NaN  256.059998  145.649994  NaN  249.690002  145.960007  NaN  249.820007  65421400  NaN  21522800
2022-12-05  146.630005  NaN  250.199997  150.919998  NaN  253.820007  145.770004  NaN  248.059998  147.770004  NaN  252.009995  68826400  NaN  23435300

Upvotes: 1

JayPeerachai
JayPeerachai

Reputation: 3842

Maybe it causes by the multi-level column as the where() method expects a single-level column. Try to flatten it first.

startdate=datetime(2022,12,1)
enddate=datetime(2022,12,6)
y_symbols = ['GOOG', 'AAPL', 'MSFT']
data=yf.download(y_symbols, start=startdate, end=enddate, auto_adjust=True, threads=True)
data = data.stack()
filtered_cond = data['Close'] > 100
filtered_data = data.where(filtered_cond).unstack()

Upvotes: 1

Alireza75
Alireza75

Reputation: 501

Where method accept a list like argument with boolean values for filtering. You must pass to it a pandas series, numpy array, python list, etc. But you pass to it a dataframe (df['close']>500) and where method raise error You can read pandas docs for more information about it

Upvotes: 0

Related Questions