Reputation: 736
I am encountering a strange problem with a pandas dataframe where in, where() fails complaining that it cannot join on the overlapping index names.
To reproduce this problem try below:
import yfinance as yf
from datetime import datetime
startdate=datetime(2022,12,1)
enddate=datetime(2022,12,6)
y_symbols = ['GOOG', 'AAPL', 'MSFT']
data=yf.download(y_symbols, start=startdate, end=enddate, auto_adjust=True, threads=True)
data[data['Close'] > 100]
Then the raised error looks like:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
..
File "lib/python3.9/site-packages/pandas/core/indexes/base.py", line 229, in join
join_index, lidx, ridx = meth(self, other, how=how, level=level, sort=sort)
File "lib/python3.9/site-packages/pandas/core/indexes/base.py", line 4658, in join
return self._join_multi(other, how=how)
File "lib/python3.9/site-packages/pandas/core/indexes/base.py", line 4782, in _join_multi
raise ValueError("cannot join with no overlapping index names")
ValueError: cannot join with no overlapping index names
Here, data
looks like:
Close High ... Open Volume
AAPL GOOG MSFT AAPL GOOG MSFT ... AAPL GOOG MSFT AAPL GOOG MSFT
Date ...
2022-12-01 148.309998 101.279999 254.690002 149.130005 102.589996 256.119995 ... 148.210007 101.400002 253.869995 71250400 21771500 26041500
2022-12-02 147.809998 100.830002 255.020004 148.000000 101.150002 256.059998 ... 145.960007 99.370003 249.820007 65421400 18812200 21522800
2022-12-05 146.630005 99.870003 250.199997 150.919998 101.750000 253.820007 ... 147.770004 99.815002 252.009995 68826400 19955500 23435300
What could be missing here in the dataframe that this would not work?
Upvotes: 1
Views: 11750
Reputation: 23281
Boolean indexing, where()
, mask()
etc. aligns both columns and indices using align()
, which needs level to align on if either column or indices are MultiIndex. The most basic way to reproduce the error is the following where an align()
call is used to align columns and indices of two dataframes. By default, it tries on all levels but because there is a mismatch in the number of levels, we get the error in the title. The solution is to pass the level=
.
df1 = pd.DataFrame({('a', 'b'): [1, 2]})
df2 = pd.DataFrame({'a': [1, 2]})
df1.align(df2) # <--- ValueError: cannot join with no overlapping index names
df1.align(df2, level=0) # <--- OK
However, with boolean indexing, we can't set the level as in align()
, so we must set level names to the MultiIndex (as OP observed) to solve the issue.
df1 = pd.DataFrame({('a', 'b'): [1, 2]})
df1.columns.names = ['f', 's']
df1[df1['a']>0] # <--- OK
Setting level names to the columns lets pandas align the columns accordingly.
Another way this error shows up is if we use mask
or where
to replace values but the original and replacement dataframes have different number of levels. In that case, the solution is to pass the level to align on:
df1 = pd.DataFrame({('a', 'b'): [1, 2]})
df1.where(df1>1, pd.DataFrame({'b': [10,20]})) # <--- ValueError:
df1.where(df1>1, pd.DataFrame({'b': [10,20]}), level=1) # <--- OK
In the above example, since the aligning column is column b
in level=1, we must pass it to make it work correctly.
Upvotes: 0
Reputation: 736
Thanks to @JayPeerachai leading answer for a direction. As to why this was working before: please check the solution section in the question at here.
I was using pandas datareader to fetch data from yahoo finance until it broke yesterday and was not able to get data, this prompted me to switch to yfinance but as it turns out getting the results was okay but the data itself was somehow not allowing the where
function queries despite being the same dataset. Hence, this question.
Further probing into pandas datareader and yfinance code helped me to understand that the column names were different, and pandas datareader actually sets the column names before returning the dataset to user code.
All it needed was, setting the column names after fetching the data and was able to run the where
function queries iteratively too.
So the solution is to set the column names after fetching the result from yfinance
.
Ideally, would like yfinance
itself to take care of this.
>>> import yfinance as yf
>>> from datetime import datetime
>>> startdate=datetime(2022,12,1)
>>> enddate=datetime(2022,12,6)
>>> y_symbols = ['GOOG', 'AAPL', 'MSFT']
>>> data=yf.download(y_symbols, start=startdate, end=enddate, auto_adjust=True, threads=True)
[*********************100%***********************] 3 of 3 completed
>>> data.columns.names = ["Attributes", "Symbols"]
>>> data[data['Close'] > 100]
Attributes Close High Low Open Volume
Symbols AAPL GOOG MSFT AAPL GOOG MSFT AAPL GOOG MSFT AAPL GOOG MSFT AAPL GOOG MSFT
Date
2022-12-01 148.309998 101.279999 254.690002 149.130005 102.589996 256.119995 146.610001 100.669998 250.919998 148.210007 101.400002 253.869995 71250400 21771500.0 26041500
2022-12-02 147.809998 100.830002 255.020004 148.000000 101.150002 256.059998 145.649994 99.169998 249.690002 145.960007 99.370003 249.820007 65421400 18812200.0 21522800
2022-12-05 146.630005 NaN 250.199997 150.919998 NaN 253.820007 145.770004 NaN 248.059998 147.770004 NaN 252.009995 68826400 NaN 23435300
>>>
>>> data.where(data['Close'] > 100).where(data['High'] > 120)
Attributes Close High Low Open Volume
Symbols AAPL GOOG MSFT AAPL GOOG MSFT AAPL GOOG MSFT AAPL GOOG MSFT AAPL GOOG MSFT
Date
2022-12-01 148.309998 NaN 254.690002 149.130005 NaN 256.119995 146.610001 NaN 250.919998 148.210007 NaN 253.869995 71250400 NaN 26041500
2022-12-02 147.809998 NaN 255.020004 148.000000 NaN 256.059998 145.649994 NaN 249.690002 145.960007 NaN 249.820007 65421400 NaN 21522800
2022-12-05 146.630005 NaN 250.199997 150.919998 NaN 253.820007 145.770004 NaN 248.059998 147.770004 NaN 252.009995 68826400 NaN 23435300
Upvotes: 1
Reputation: 3842
Maybe it causes by the multi-level column as the where()
method expects a single-level column. Try to flatten it first.
startdate=datetime(2022,12,1)
enddate=datetime(2022,12,6)
y_symbols = ['GOOG', 'AAPL', 'MSFT']
data=yf.download(y_symbols, start=startdate, end=enddate, auto_adjust=True, threads=True)
data = data.stack()
filtered_cond = data['Close'] > 100
filtered_data = data.where(filtered_cond).unstack()
Upvotes: 1
Reputation: 501
Where method accept a list like argument with boolean values for filtering. You must pass to it a pandas series, numpy array, python list, etc. But you pass to it a dataframe (df['close']>500) and where method raise error You can read pandas docs for more information about it
Upvotes: 0