statsNoob
statsNoob

Reputation: 1355

replacing pandas dataframe variable values with a numpy array

I am doing a transformation on a variable from a pandas dataframe and then I would like to replace the column with my new values. The problem seems to be that after the transformation, the length of the array is not the same as the length of my dataframe's index. I don't think that is true though.

>>> df['variable'] = stats.boxcox(df.variable)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2119, in __setitem__
    self._set_item(key, value)
  File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2165, in _set_item
    value = self._sanitize_column(key, value)
  File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2205, in _sanitize_column
    raise AssertionError('Length of values does not match '
AssertionError: Length of values does not match length of index

When I check the length, these lengths seem to disagree. The len(array) says it is 2 but when I call the stats.boxcox it says it is 50000. What is going on here?

>>> len(df)
50000
>>> len(stats.boxcox(df.variable))
2
>>> stats.boxcox(df.variable)
(0    -0.079496
1    -0.117982
2    -0.104637

...
49985    -0.041300
49986     0.651771
49987    -0.115660
49988    -0.118034
49998    -0.118014
49999    -0.034076
Name: feat9, Length: 50000, dtype: float64, 8.4721358117221772)
>>> 

Upvotes: 3

Views: 4602

Answers (1)

BrenBarn
BrenBarn

Reputation: 251365

You can see in your example that the result of boxcox is a tuple. This is consistent with the documentation, which indicates that boxcox returns a tuple of the transformed data and a lambda value. Notice in the example on that page that it does:

xt, _ = stats.boxcox(x)

. . . showing again that boxcox returns a 2-tuple.

You should be doing df['variable'] = stats.boxcox(df.variable)[0].

Upvotes: 11

Related Questions