Reputation: 77454
Here is an example from an IPython session where some straightforward indexing and assignments to a Pandas DataFrame work and some don't work when they seem straightforward:
In [652]: dfrm = pandas.DataFrame(np.random.rand(10,3), columns=['A', 'B', 'C'])
In [653]: dfrm
Out[653]:
A B C
0 0.777147 0.558404 0.424222
1 0.906354 0.111197 0.492625
2 0.011354 0.468661 0.056303
3 0.118818 0.117526 0.649210
4 0.746045 0.583369 0.962173
5 0.374871 0.285712 0.868599
6 0.223596 0.963223 0.012154
7 0.969879 0.043160 0.891143
8 0.527701 0.992965 0.073797
9 0.553854 0.969303 0.523098
In [654]: dfrm['A'][dfrm.A > 0.5] = [1,2,3,4,5,6]
In [655]: dfrm
Out[655]:
A B C
0 1.000000 0.558404 0.424222
1 2.000000 0.111197 0.492625
2 0.011354 0.468661 0.056303
3 0.118818 0.117526 0.649210
4 3.000000 0.583369 0.962173
5 0.374871 0.285712 0.868599
6 0.223596 0.963223 0.012154
7 4.000000 0.043160 0.891143
8 5.000000 0.992965 0.073797
9 6.000000 0.969303 0.523098
In [656]: dfrm[['B','C']][dfrm.A > 0.5] = 100*np.random.rand(6,2)
In [657]: dfrm
Out[657]:
A B C
0 1.000000 0.558404 0.424222
1 2.000000 0.111197 0.492625
2 0.011354 0.468661 0.056303
3 0.118818 0.117526 0.649210
4 3.000000 0.583369 0.962173
5 0.374871 0.285712 0.868599
6 0.223596 0.963223 0.012154
7 4.000000 0.043160 0.891143
8 5.000000 0.992965 0.073797
9 6.000000 0.969303 0.523098
In [658]: dfrm[dfrm.A > 0.5] = 100*np.random.rand(6,3)
In [659]: dfrm
Out[659]:
A B C
0 27.738118 18.812116 46.369840
1 35.335223 58.365611 7.773464
2 0.011354 0.468661 0.056303
3 0.118818 0.117526 0.649210
4 97.439481 98.621074 69.816171
5 0.374871 0.285712 0.868599
6 0.223596 0.963223 0.012154
7 53.609637 30.952762 81.379502
8 68.473117 16.261694 91.092718
9 82.253724 94.979991 72.571951
In [660]: dfrm[dfrm.A > 0.5] = 0.5*dfrm[dfrm.A > 0.5]
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-660-35fb8e212806> in <module>()
----> 1 dfrm[dfrm.A > 0.5] = 0.5*dfrm[dfrm.A > 0.5]
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in __setitem__(self, key, value)
1707 self._boolean_set(key, value)
1708 elif isinstance(key, (np.ndarray, list)):
-> 1709 return self._set_item_multiple(key, value)
1710 else:
1711 # set column
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in _set_item_multiple(self, keys, value)
1728 def _set_item_multiple(self, keys, value):
1729 if isinstance(value, DataFrame):
-> 1730 assert(len(value.columns) == len(keys))
1731 for k1, k2 in zip(keys, value.columns):
1732 self[k1] = value[k2]
AssertionError:
Can anyone explain why some (but not all) of these work, and why the final one actually induces as error?
Update:
We have Pandas 0.11 installed, but it's not the default version for development so it's only a sandbox sort of thing for me right now. But even when I repeat this example in 0.11, I see the same assignment problems, except that the last example now works correctly with no error. But the muddled-ness of the conventions for how to invoke the original DataFrame's __setitem__
are still there:
Python 2.7.3 |EPD 7.3-2 (64-bit)| (default, Apr 11 2012, 17:52:16)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-44)] on linux2
Type "credits", "demo" or "enthought" for more information.
Hello
>>> import pandas
>>> pandas.__version__
'0.11.0'
>>> dfrm = pandas.DataFrame(np.random.rand(10,3), columns=['A', 'B', 'C'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'np' is not defined
>>> import numpy as np
>>> dfrm = pandas.DataFrame(np.random.rand(10,3), columns=['A', 'B', 'C'])
>>> dfrm
A B C
0 0.745516 0.062613 0.147684
1 0.369141 0.447022 0.114963
2 0.820178 0.946806 0.687971
3 0.771971 0.934799 0.633633
4 0.828249 0.065587 0.848788
5 0.433796 0.740885 0.160140
6 0.663891 0.753134 0.849269
7 0.647054 0.962267 0.453865
8 0.345706 0.030634 0.058697
9 0.994135 0.990536 0.436903
>>> dfrm[dfrm.A > 0.5]
A B C
0 0.745516 0.062613 0.147684
2 0.820178 0.946806 0.687971
3 0.771971 0.934799 0.633633
4 0.828249 0.065587 0.848788
6 0.663891 0.753134 0.849269
7 0.647054 0.962267 0.453865
9 0.994135 0.990536 0.436903
>>> len(dfrm[dfrm.A > 0.5])
7
>>> dfrm['A'][dfrm.A > 0.5] = [1,2,3,4,5,6,7]
>>> dfrm
A B C
0 1.000000 0.062613 0.147684
1 0.369141 0.447022 0.114963
2 2.000000 0.946806 0.687971
3 3.000000 0.934799 0.633633
4 4.000000 0.065587 0.848788
5 0.433796 0.740885 0.160140
6 5.000000 0.753134 0.849269
7 6.000000 0.962267 0.453865
8 0.345706 0.030634 0.058697
9 7.000000 0.990536 0.436903
>>> dfrm[['B','C']][dfrm.A > 0.5] = 100*np.random.rand(7,2)
>>> dfrm
A B C
0 1.000000 0.062613 0.147684
1 0.369141 0.447022 0.114963
2 2.000000 0.946806 0.687971
3 3.000000 0.934799 0.633633
4 4.000000 0.065587 0.848788
5 0.433796 0.740885 0.160140
6 5.000000 0.753134 0.849269
7 6.000000 0.962267 0.453865
8 0.345706 0.030634 0.058697
9 7.000000 0.990536 0.436903
>>> dfrm[dfrm.A > 0.5] = 0.5*dfrm[dfrm.A > 0.5]
>>> dfrm
A B C
0 0.500000 0.031306 0.073842
1 0.369141 0.447022 0.114963
2 1.000000 0.473403 0.343985
3 1.500000 0.467400 0.316816
4 2.000000 0.032794 0.424394
5 0.433796 0.740885 0.160140
6 2.500000 0.376567 0.424635
7 3.000000 0.481133 0.226933
8 0.345706 0.030634 0.058697
9 3.500000 0.495268 0.218452
>>>
Second Update:
Here's another super unexpected behavior:
In [681]: id(dfrm.A)
Out[681]: 298480536
In [682]: id(dfrm.A)
Out[682]: 298480536
In [683]: id(dfrm.A)
Out[683]: 298480536
In [684]: id(dfrm['A'])
Out[684]: 298480536
In [685]: id(dfrm['A'])
Out[685]: 298480536
In [686]: id(dfrm['A'])
Out[686]: 298480536
In [687]: id(dfrm[['A']])
Out[687]: 281536912
In [688]: id(dfrm[['A']])
Out[688]: 281535824
In [689]: id(dfrm[['A']])
Out[689]: 281536336
Upvotes: 1
Views: 150
Reputation: 375595
Assigning with two or more getitems/slices (chaining) may or may not work depending on the situation...
so you should avoid doing it!! You should rewrite to do each in one pass.
There was quite a substantial amount of work in 0.11 (possibly before) to clear up this behaviour... Now pandas overloads these assignments to not care if it's a view or a copy, if you are doing this in one pass, which you should be doing, in general.
For example:
dfrm.loc[dfrm.A > 0.5, 'A'] = [1, 2, 3, 4, 5, 6]
dfrm.loc[[dfrm.A > 0.5], ['B','C']] = 100 * np.random.rand(6, 2)
Also, generally good practise to specify that you are indexing by the label (with the loc):
dfrm.loc[dfrm.A > 0.5] = 100 * np.random.rand(6, 3)
You could also consider rewriting:
dfrm.loc[dfrm.A > 0.5] = 0.5 * dfrm.loc[dfrm.A > 0.5]
to
dfrm.loc[dfrm.A > 0.5] *= 0.5
This is a surprising error in 0.8.1 (but seems to be fixed in later versions), perhaps a workaround (if the above doesn't work) is to set the fancy index first (df_A_gt_half = dfrm.A > 0.5
) and then do the assignment using that... and are forced to use ix
rather than loc
.
Upvotes: 1