Why slicing on two similar DataFrames working differently?

Question

I understand slicing on Pandas DataFrames returns rows as DataFrame and should return empty DataFrame if no row is selected.

My question is regarding the discrepancy between the two examples I am trying in Pandas 1.2.0

The first on results empty dataframe (as I was expecting):

>> df = pd.DataFrame(np.arange(4 * 4).reshape(4, 4),  
                     index=['r1', 'r2', 'r3', 'r4'], 
                     columns=['c1', 'c2', 'c3', 'c4'])
>> df['c2': 'c3']
Empty DataFrame
Columns: [c1, c2, c3, c4]
Index: []

But the second one (picked from "Python for Data Analysis, 2nd Edition") throw KeyError!

>> data = pd.DataFrame(np.arange(4 * 4).reshape(4, 4),
                       index=['Ohio', 'Colorado', 'Utah', 'New York'],
                       columns=['one', 'two', 'three', 'four'])
>> try:
...:     data['two': 'three']
...: except KeyError as k:
...:     print(f"KeyError: {k}")
...: 
KeyError: 'two'

My question is why the key exception for the second data frame? Why two different behavior? Am I missing something or there is bug in 1.2 version

I verified by example multiple times, I hope there is no typo. Attaching

ThePyGuy · Accepted Answer

TL;DR:

pandas throws KeyError for row slicing if the row indices are not sorted, and one of the the two indices is missing.

Looking at documentation for slicing with labels, pandas clearly mentions that it will throw a KeyError

However, if at least one of the two is absent and the index is not sorted, an error will be raised (since doing otherwise would be computationally expensive, as well as potentially ambiguous for mixed type indexes)

When you are doing df['c1':'c2'] and data['two': 'three'], you are observing two different behaviors because of the index values for rows.

In the first case i.e. df, the indices are sorted so no KeyError is thrown, in the second case index values are not sorted so it throws keyError, and error will not be thrown if you try the same thing after sorting the index.

>>> data.sort_index()['one':'two']
Empty DataFrame
Columns: [one, two, three, four]
Index: []

In both of the above examples, pandas will look for the values in the row indices not the column indices, and that is why it will return an empty dataframe even if the values exist in the column index:

>>> df['c1':'c4']
Empty DataFrame
Columns: [c1, c2, c3, c4]
Index: []

You might be curious why above results in empty dataframe instead of throwing keyError again since neither of c1 and c4 exist in the row index. That is actually the native python behavior which doesn't throw any IndexError for slicing, try:

>>> [][100:200]
[]

And since the row indices are not sorted in the second case, pandas will throw KeyError, because pandas is not actually looking at those values at column indices, but the row index values.

Why slicing on two similar DataFrames working differently?

Answers (1)

Related Questions