Cleb
Cleb

Reputation: 25997

Why does .loc behave differently depending on whether values are printed or assigned?

I got confused about the following behavior. When I have a dataframe like this:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'), index=list('bcdefg'))

which looks as follows:

          A         B         C         D
b -0.907325  0.211740  0.150066 -0.240011
c -0.307543  0.691359 -0.179995 -0.334836
d  1.280978  0.469956 -0.912541  0.487357
e  1.447153 -0.087224 -0.176256  1.319822
f  0.660994 -0.289151  0.956900 -1.063623
g -1.880520  1.099098 -0.759683 -0.657774

I receive the expected error

TypeError: cannot do slice indexing on with these indexers [3] of type 'int'

when I try the following slice using .loc:

print df.loc[3:, ['C', 'D']]

It is expected as I pass an integer as an index and not one of the letters contained in the index.

However, if I now try

df.loc[3:, ['C', 'D']] = 10

it works fine and gives me the output:

          A         B          C          D
b -0.907325  0.211740   0.150066  -0.240011
c -0.307543  0.691359  -0.179995  -0.334836
d  1.280978  0.469956  -0.912541   0.487357
e  1.447153 -0.087224  10.000000  10.000000
f  0.660994 -0.289151  10.000000  10.000000
g -1.880520  1.099098  10.000000  10.000000

My question is why the same command fails when something is printed and why it works when a value is assigned. When I check the doc string for .loc, I would have expected that this would always result in the error mentioned above (see especially the bold part):

Allowed inputs are:

  • A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and **never as an integer position along the index**).
  • A list or array of labels, e.g. ['a', 'b', 'c'].
  • A slice object with labels, e.g. 'a':'f' (note that contrary to usual python slices, both the start and the stop are included!).
  • A boolean array.
  • A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)

.loc will raise a KeyError when the items are not found.

Any explanation for that; what am I missing here?

EDIT

In this question similar behavior is considered a bug which was fixed in 0.13. I use 0.19.1.

EDIT 2 Building up on @EdChum's post, one can do the following:

df.loc[2] = 20
df.loc[3] = 30
df.loc[4] = 40

which yields

           A          B          C          D
b   0.083326  -1.047032   0.830499  -0.729662
c   0.942744  -0.535013   0.809251   1.132983
d  -0.074918   1.123331  -2.205294  -0.497468
e   0.213349   0.694366  -0.816550   0.496324
f   0.021347   0.917340  -0.595254  -0.392177
g  -1.149890   0.965645   0.172672  -0.043652
2  20.000000  20.000000  20.000000  20.000000
3  30.000000  30.000000  30.000000  30.000000
4  40.000000  40.000000  40.000000  40.000000

However, that is then still confusing to me because while

print df.loc['d':'f', ['C', 'D']]

works fine, the command

print df.loc[2:4, ['C', 'D']]

gives the index error mentioned above.

Additionally, when one now assigns values like this

df.loc[2:4, ['C', 'D']] = 100

the dataframe looks as follows:

           A          B           C           D
b   0.083326  -1.047032    0.830499   -0.729662
c   0.942744  -0.535013    0.809251    1.132983
d  -0.074918   1.123331  100.000000  100.000000
e   0.213349   0.694366  100.000000  100.000000
f   0.021347   0.917340   -0.595254   -0.392177
g  -1.149890   0.965645    0.172672   -0.043652
2  20.000000  20.000000   20.000000   20.000000
3  30.000000  30.000000   30.000000   30.000000
4  40.000000  40.000000   40.000000   40.000000

So the values are not added where one - or at least I - would expect them to be added (the position rather than the label is used).

Upvotes: 9

Views: 1251

Answers (1)

EdChum
EdChum

Reputation: 394041

I don't think this is a bug rather undocumented semantics, for instance setting with enlargement is allowed for the simple case where the row label doesn't exist:

In [22]:
df.loc[3] = 10
df

Out[22]:
           A          B          C          D
b  -0.907325   0.211740   0.150066  -0.240011
c  -0.307543   0.691359  -0.179995  -0.334836
d   1.280978   0.469956  -0.912541   0.487357
e   1.447153  -0.087224  -0.176256   1.319822
f   0.660994  -0.289151   0.956900  -1.063623
g  -1.880520   1.099098  -0.759683  -0.657774
3  10.000000  10.000000  10.000000  10.000000

and if we pass a slice the labels aren't found in the slice but as it's an integer slice it gets converted to an ordinal slice:

In [24]:
df.loc[3:5] = 9
df

Out[24]:
           A          B          C          D
b  -0.907325   0.211740   0.150066  -0.240011
c  -0.307543   0.691359  -0.179995  -0.334836
d   1.280978   0.469956  -0.912541   0.487357
e   9.000000   9.000000   9.000000   9.000000
f   9.000000   9.000000   9.000000   9.000000
g  -1.880520   1.099098  -0.759683  -0.657774
3  10.000000  10.000000  10.000000  10.000000

the post you linked and the bug was referring to selection without assignment where a non-existent label is being passed which should raise a KeyError, which is different here

If we look at __setitem__:

def __setitem__(self, key, value):
        key = com._apply_if_callable(key, self)

        # see if we can slice the rows
        indexer = convert_to_index_sliceable(self, key))

Here it will try to convert the slice calling convert_to_index_sliceable:

def convert_to_index_sliceable(obj, key):
    """if we are index sliceable, then return my slicer, otherwise return None
    """
    idx = obj.index
    if isinstance(key, slice):
        return idx._convert_slice_indexer(key, kind='getitem')

If we look at the docstrings for this:

Signature: df.index._convert_slice_indexer(key, kind=None) Docstring: convert a slice indexer. disallow floats in the start/stop/step

Parameters ---------- key : label of the slice bound kind : {'ix', 'loc', 'getitem', 'iloc'} or None

and then run this:

In [29]:
df.index._convert_slice_indexer(slice(3,5),'loc')

Out[29]:
slice(3, 5, None)

this is then used to slice the index:

In [28]:
df.index[df.index._convert_slice_indexer(slice(3,5),'loc')]

Out[28]:
Index(['e', 'f'], dtype='object')

So we see that even though you passed what appeared to be non-existent labels, the integer slice object was converted into an ordinal slice that was compatible with the df according to different rules

Upvotes: 3

Related Questions