rhug123
rhug123

Reputation: 8768

Why does .loc not always match column names?

I noticed this today and wanted to ask because I am a little confused about this.

Lets say we have two df's

df = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('ABC'))
    A   B   C
0   3   1   6
1   2   4   0
2   8   8   0
3   8   6   7
4   4   5   0

df2 = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('CBA'))

    C   B   A
0   3   5   5
1   7   4   6
2   0   7   7
3   6   6   5
4   4   0   6

If we wanted to conditionally assign new values in the first df with values, we could do this:

df.loc[df['A'].gt(3)] = df2

I would expect the columns to be aligned, and if there were missing columns, for the values in the first df to be populated with nan. However when the above code is run, it replaces the data and does not take into account the column names. (it does take the index names into account however)

    A   B   C
0   3   1   6
1   2   4   0
2   0   7   7
3   6   6   5
4   4   0   6

on index 2 instead of [7,7,0] we have [0,7,7].

However, if we pass the names of the columns into the loc statement, without changing the order of the columns in df2, it aligns with the columns.

df.loc[df['A'].gt(3),['A','B','C']] = df2
    A   B   C
0   3   1   6
1   2   4   0
2   7   7   0
3   5   6   6
4   6   0   4

Why does this happen?

Upvotes: 1

Views: 583

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35646

Interestingly, loc performs a number of optimizations to improve performance, one of those optimizations is checking the type of the index passed in.

Both Row and Column Indexes Included

When passing both a row index and a column index the __setitem__ function:

def __setitem__(self, key, value):
    if isinstance(key, tuple):
        key = tuple(com.apply_if_callable(x, self.obj) for x in key)
    else:
        key = com.apply_if_callable(key, self.obj)
    indexer = self._get_setitem_indexer(key)
    self._has_valid_setitem_indexer(key)

    iloc = self if self.name == "iloc" else self.obj.iloc
    iloc._setitem_with_indexer(indexer, value, self.name)

Interprets the key as a tuple.

key:

(0    False
1    False
2     True
3     True
4     True
Name: A, dtype: bool, 
['A', 'B', 'C'])

This is then passed to _get_setitem_indexer to convert to a positional indexer from label-based:

indexer = self._get_setitem_indexer(key)
def _get_setitem_indexer(self, key):
    """
    Convert a potentially-label-based key into a positional indexer.
    """
    if self.name == "loc":
        self._ensure_listlike_indexer(key)

    if self.axis is not None:
        return self._convert_tuple(key, is_setter=True)

    ax = self.obj._get_axis(0)

    if isinstance(ax, ABCMultiIndex) and self.name != "iloc":
        with suppress(TypeError, KeyError, InvalidIndexError):
            # TypeError e.g. passed a bool
            return ax.get_loc(key)

    if isinstance(key, tuple):
        with suppress(IndexingError):
            return self._convert_tuple(key, is_setter=True)

    if isinstance(key, range):
        return list(key)

    try:
        return self._convert_to_indexer(key, axis=0, is_setter=True)
    except TypeError as e:

        # invalid indexer type vs 'other' indexing errors
        if "cannot do" in str(e):
            raise
        elif "unhashable type" in str(e):
            raise
        raise IndexingError(key) from e

This generates a tuple indexer (both rows and columns are converted):

if isinstance(key, tuple):
    with suppress(IndexingError):
        return self._convert_tuple(key, is_setter=True)

returns

(array([2, 3, 4], dtype=int64), array([0, 1, 2], dtype=int64))

Only Row Index Included

However, when only a row index is passed to loc the indexer is not a tuple and, as such, only a single dimension is converted from label to positional:

if isinstance(key, range):
    return list(key)

returns

[2 3 4]

For this reason, no alignment happens among columns when only a single value is passed to loc, as no parsing is done to align the columns.


That is why an empty slice is often used:

df.loc[df['A'].gt(3), :] = df2

As this is sufficient to align the columns appropriately.

import numpy as np
import pandas as pd

np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('ABC'))
df2 = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('CBA'))
print(df)
print(df2)

df.loc[df['A'].gt(3), :] = df2
print(df)

Example:

df:

   A  B  C
0  3  6  6
1  0  8  4
2  7  0  0
3  7  1  5
4  7  0  1

df2:

   C  B  A
0  4  6  2
1  1  2  7
2  0  5  0
3  0  4  4
4  3  2  4

df.loc[df['A'].gt(3), :] = df2:

   A  B  C
0  3  6  6
1  0  8  4
2  0  5  0
3  4  4  0  # Aligned as expected
4  4  2  3

Upvotes: 2

Related Questions