Reputation: 1072

Inconsistent behavior when inserting a set into cells using .loc in pandas

It's a pretty simple example

import pandas
df = pandas.DataFrame()
value_to_be_set = {'1'}

df.loc[0, 'col1'] = value_to_be_set
df['col2'] = None
df.loc[0, 'col2'] = value_to_be_set

print(df.head())

output

   col1 col2
0    1  {1}

Why is the datatype different for both columns?

Python 3.7.3
pandas version: 0.23.4

Upvotes: 8

Answers (3)

Dark debo

Reputation: 69

import pandas
df = pandas.DataFrame()
value_to_be_set = {'1'}

df.loc[0, 'col1'] = value_to_be_set
df['col2'] = None
df.loc[0, 'col2'] = value_to_be_set

print(df.head())

Here you in col1 you directly send the value like it takes the iterable like set and iterate through it to set value and there it find 1 as element and it set it.

where in 2nd case you set col2 with None and therefore in this case the whole set is taken as a element and set {1} as a value.

import pandas
df = pandas.DataFrame()
value_to_be_set = {'1'}

df.loc[0, 'col1'] = value_to_be_set
#comment down
#df['col2'] = None
df.loc[0, 'col2'] = value_to_be_set

print(df.head())

After comment down the None the value of col2 is same as col1.

 col1 col2
0    1    1

Upvotes: 1

r.ook

Reputation: 13878

When you expand the value_to_be_set object to greater than one element, this error occurs:

Traceback (most recent call last):
  File "<pyshell#314>", line 1, in <module>
    df.loc[0, 'col1'] = value_to_be_set
  File "C:\Users\rook\Projects\Sandbox\env\lib\site-packages\pandas\core\indexing.py", line 671, in __setitem__
    self._setitem_with_indexer(indexer, value)
  File "C:\Users\rook\Projects\Sandbox\env\lib\site-packages\pandas\core\indexing.py", line 850, in _setitem_with_indexer
    self._setitem_with_indexer(new_indexer, value)
  File "C:\Users\rook\Projects\Sandbox\env\lib\site-packages\pandas\core\indexing.py", line 1019, in _setitem_with_indexer
    "Must have equal len keys and value "
ValueError: Must have equal len keys and value when setting with an iterable

Whereas when you call it for 'col2' after initializing the column, this doesn't occur.

Looking at my env source code for __setitem__ on 1.0.3 reveals:

def __setitem__(self, key, value):
    if isinstance(key, tuple):
        key = tuple(com.apply_if_callable(x, self.obj) for x in key)
    else:
        key = com.apply_if_callable(key, self.obj)
    indexer = self._get_setitem_indexer(key)
    self._setitem_with_indexer(indexer, value)

The last line of which has been changed to below in 1.0.4 current branch on github:

def __setitem__(self, key, value):
    # ... same as above ... #
    self._has_valid_setitem_indexer(key)

    iloc = self if self.name == "iloc" else self.obj.iloc
    iloc._setitem_with_indexer(indexer, value)

However the _has_valid_setitem_indexer seem to be still in the works:

def _has_valid_setitem_indexer(self, indexer) -> bool:
    """
    Validate that a positional indexer cannot enlarge its target
    will raise if needed, does not modify the indexer externally.
    Returns
    -------
    bool
    """
    if isinstance(indexer, dict):
        raise IndexError("iloc cannot enlarge its target object")
    else:
        if not isinstance(indexer, tuple):
            indexer = _tuplify(self.ndim, indexer)
        for ax, i in zip(self.obj.axes, indexer):
            if isinstance(i, slice):
                # should check the stop slice?
                pass
            elif is_list_like_indexer(i):
                # should check the elements?
                pass
            elif is_integer(i):
                if i >= len(ax):
                    raise IndexError("iloc cannot enlarge its target object")
            elif isinstance(i, dict):
                raise IndexError("iloc cannot enlarge its target object")

    return True

In any case, I would suggest submit this as a bug since it is still reproducible in the latest version 1.0.4:

>>> df.loc[0, 'col1'] = v2
>>> df['col2'] = None
>>> df.loc[0, 'col2'] = v2
>>> df
  col1 col2
0    1  {1}
>>> pd.__version__
'1.0.4'

The absurdity is apparent if you insert the same item to a second index:

>>> df = pd.DataFrame()
>>> df.loc[0, 'col1'] = v
>>> df.loc[1, 'col1'] = v
>>> df
  col1
0    1
1  {1}

I would think using loc to set up new columns is buggy indeed due to implied unpacking.

Upvotes: 3

Serge Ballesta

Reputation: 148910

In first assignment, you create a num_column from a set, said differently from an iterable. You ask for 1 single element and provide an iterable of size one, so you affect the content of the set to the single cell. You can try to use a set of 2 values to see that it would raise an error.

In second assignment, you update a cell in an existing column. Pandas has no reason to unpack anything here, and it affects the set to the cell.

To be honest, this explains what happens, but is not a justification for the rationale behind the different behaviours...

Upvotes: 7

Inconsistent behavior when inserting a set into cells using .loc in pandas

Answers (3)

Related Questions