Reputation: 1435
I'm confused why bracket [] and .loc behave differently when creating multiple columns. I've looked into other similar questions but I couldn't get the answer.
For example,
>>> dates = pd.date_range('1/1/2000', periods=8)
>>> df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
>>> df
A B C D
2000-01-01 -1.011264 -1.751948 0.059012 0.514253
2000-01-02 0.596959 0.348866 -1.011628 0.950259
2000-01-03 0.305281 0.486400 -1.034534 -1.523402
2000-01-04 -0.880457 0.379837 2.023866 1.588379
2000-01-05 -1.142070 -0.168992 -0.391355 0.809820
2000-01-06 -0.335015 0.721563 -0.665120 -1.097811
2000-01-07 -0.160611 -0.601393 -0.257349 -0.830527
2000-01-08 0.197624 -0.082786 1.335873 -0.841006
If I create multiple columns with brackets, it works as below.
>>> df[['E','F']] = df[['A','B']]
>>> df
A B C D E F
2000-01-01 -1.011264 -1.751948 0.059012 0.514253 -1.011264 -1.751948
2000-01-02 0.596959 0.348866 -1.011628 0.950259 0.596959 0.348866
2000-01-03 0.305281 0.486400 -1.034534 -1.523402 0.305281 0.486400
2000-01-04 -0.880457 0.379837 2.023866 1.588379 -0.880457 0.379837
2000-01-05 -1.142070 -0.168992 -0.391355 0.809820 -1.142070 -0.168992
2000-01-06 -0.335015 0.721563 -0.665120 -1.097811 -0.335015 0.721563
2000-01-07 -0.160611 -0.601393 -0.257349 -0.830527 -0.160611 -0.601393
2000-01-08 0.197624 -0.082786 1.335873 -0.841006 0.197624 -0.082786
However, if I use .loc method for creating multiple columns, it doesn't work.
>>> df.loc[:,['H','I']] = df[['A','B']]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python\lib\site-packages\pandas\core\indexing.py", line 189, in __setitem__
indexer = self._get_setitem_indexer(key)
File "C:\Python\lib\site-packages\pandas\core\indexing.py", line 167, in _get_setitem_indexer
return self._convert_tuple(key, is_setter=True)
File "C:\Python\lib\site-packages\pandas\core\indexing.py", line 248, in _convert_tuple
idx = self._convert_to_indexer(k, axis=i, is_setter=is_setter)
File "C:\Python\lib\site-packages\pandas\core\indexing.py", line 1354, in _convert_to_indexer
return self._get_listlike_indexer(obj, axis, **kwargs)[1]
File "C:\Python\lib\site-packages\pandas\core\indexing.py", line 1161, in _get_listlike_indexer
raise_missing=raise_missing)
File "C:\Python\lib\site-packages\pandas\core\indexing.py", line 1246, in _validate_read_indexer
key=key, axis=self.obj._get_axis_name(axis)))
KeyError: "None of [Index(['H', 'I'], dtype='object')] are in the [columns]"
.loc method works well when creating only one column. (square bracket works as well.)
>>> df.loc[:,'G'] = df['A']
>>> df
A B C D E F G
2000-01-01 -1.011264 -1.751948 0.059012 0.514253 -1.011264 -1.751948 -1.011264
2000-01-02 0.596959 0.348866 -1.011628 0.950259 0.596959 0.348866 0.596959
2000-01-03 0.305281 0.486400 -1.034534 -1.523402 0.305281 0.486400 0.305281
2000-01-04 -0.880457 0.379837 2.023866 1.588379 -0.880457 0.379837 -0.880457
2000-01-05 -1.142070 -0.168992 -0.391355 0.809820 -1.142070 -0.168992 -1.142070
2000-01-06 -0.335015 0.721563 -0.665120 -1.097811 -0.335015 0.721563 -0.335015
2000-01-07 -0.160611 -0.601393 -0.257349 -0.830527 -0.160611 -0.601393 -0.160611
2000-01-08 0.197624 -0.082786 1.335873 -0.841006 0.197624 -0.082786 0.197624
I'm confused about why .loc is not functioning as square brackets in creating multiple columns. I prefer explicit way as .loc[] does, and I'm quite bothered by the fact that its functionality is sometimes limited. Am I missing something? Could I ask why they work differently in this case?
* Addition to the original question *
.loc[] method generates NaN columns when it is assigned to existing columns. For example,
>>>df[['E','F']] = df[['A','B']]
>>> df
A B ... E F
2000-01-01 0.934380 -0.321112 ... 0.934380 -0.321112
2000-01-02 -0.760045 0.646212 ... -0.760045 0.646212
2000-01-03 0.645231 -0.910008 ... 0.645231 -0.910008
2000-01-04 -1.117132 2.595804 ... -1.117132 2.595804
2000-01-05 -1.273579 0.291202 ... -1.273579 0.291202
2000-01-06 0.142610 -0.368157 ... 0.142610 -0.368157
2000-01-07 0.567490 -1.598343 ... 0.567490 -1.598343
2000-01-08 1.300694 0.498405 ... 1.300694 0.498405
I generated new columns E,F and tried to assign new values from C,D using .loc[] method.
>>>df.loc[:,['E','F']] = df[['C','D']]
>>>df
A B C D E F
2000-01-01 0.934380 -0.321112 0.747195 -0.991180 NaN NaN
2000-01-02 -0.760045 0.646212 -0.121421 2.262384 NaN NaN
2000-01-03 0.645231 -0.910008 0.170989 -1.552823 NaN NaN
2000-01-04 -1.117132 2.595804 0.569809 1.575253 NaN NaN
2000-01-05 -1.273579 0.291202 0.688443 -0.581674 NaN NaN
2000-01-06 0.142610 -0.368157 -0.674774 -1.961087 NaN NaN
2000-01-07 0.567490 -1.598343 -1.346179 -1.139205 NaN NaN
2000-01-08 1.300694 0.498405 -0.358015 -1.637471 NaN NaN
It seems that using .loc[] still makes problem.
Upvotes: 0
Views: 521
Reputation: 14103
As previously said it is done intentionally. Here are a few examples:
It looks it has to do with __getitem__
, which is called when using []
Let's look at a few errors:
df['H']
returns a simular error to df.loc[:,'H']
Both seem to use pandas\core\frame.py __getitem__
which is why they behave the same when setting:
df['H'] = df['A']
df.loc[:, 'H'] = df['A']
When you use loc
with a list (df.loc[:, ['H', 'I']]
or df.loc[:, ['H']]
) it is no longer using pandas\core\frame.py __getitem__
It uses pandas\core\indexing.py __getitem__
which sets raise_missing
to False in _validate_read_indexer
There is a comment in this function that provides some information:
# We (temporarily) allow for some missing keys with .loc, except in
# some cases (e.g. setting) in which "raise_missing" will be False
df[['H','I']]
uses pandas\core\frame.py __getitem__
which is why you there is no error when setting.
This is just my guess as to what is going on.
Your other question about df.loc[:,['E','F']] = df[['C','D']]
is explained in the docs under "The correct way to swap column values is by using raw values". You should use to_numpy()
: df.loc[:,['E','F']] = df[['C','D']].to_numpy()
Upvotes: 1
Reputation: 153460
This is intended behavior put in to pandas after 0.21.0 See docs here.
The root of your error message is this part, where either 'H' or 'I' is missing in the dataframe:
df.loc[:,['H','I']]
Using a list with .loc and missing values will raise a KeyError.
Upvotes: 0