Pandas data frame behavior

Question

This code works--it sets each column to its mean:

def setSerNanToMean(serAll):
    return serAll.replace(np.NaN, serAll.mean())
def setPdfNanToMean(pdfAll, listCols):
    pdfAll.ix[:,listCols] = pdfAll.ix[:,listCols].apply(setSerNanToMean)
setPdfNanToMean(pdfAll, [1,2,3,4])

This code does not work:

def setSerNanToMean(serAll):
    return serAll.replace(np.NaN, serAll.mean())
def setPdfNanToMean(pdfAll, listCols):
    pdfAll.ix[:,listCols].apply(setSerNanToMean) # This line has changed!
setPdfNanToMean(pdfAll, [1,2,3,4])

Why does the second block of code not work? Doesn't DataFrame.apply() default to inplace? There is no inplace parameter to the apply function. If it doesn't work in place, doesn't this make pandas a terrible memory handler? Do all pandas data frame operations copy everything in situations like this? Wouldn't it be better to just do it inplace? Even if it doesn't default to inplace, shouldn't it provide an inplace parameter the way replace() does?

I'm not looking for specific answers as much as general understanding. Again, one of these code blocks works, so I can keep moving forward, but what I really want to do is understand how pandas handles the manipulation of memory objects. I have McKinney's book, so page references very welcome so you don't have to type too much.

Andy Hayden · Accepted Answer

No, apply does not work inplace*.

Here's another for you: the inplace flag doesn't actually mean whatever function is actually happening inplace (!). To give an example:

In [11]: s = pd.Series([1, 2, np.nan, 4])

In [12]: s._data._values
Out[12]: array([  1.,   2.,  nan,   4.])

In [13]: vals = s._data._values

In [14]: s.fillna(s.mean(), inplace=True)

In [15]: vals is s._data._values  # values are the same
Out[15]: True

In [16]: vals
Out[16]: array([ 1.        ,  2.        ,  2.33333333,  4.        ])

In [21]: s = pd.Series([1, 2, np.nan, 4])  # start again

In [22]: vals = s._data._values

In [23]: s.fillna('mean', inplace=True)

In [24]: vals is s._data._values  # values are *not* the same
Out[24]: False

In [25]: s._data._values
Out[25]: array([1.0, 2.0, 'mean', 4.0], dtype=object)

Note: often if the type is the same then so is the values array but pandas does not guarantee this.

In general apply is slow (since you are basically iterating through each row in python), and the "game" is to rewrite that function in terms of pandas/numpy native functions and indexing. If you want to delve into more details about the internals, check out the BlockManager in core/internals.py, this is the object which holds the underlying numpy arrays. But to be honest I think your most useful tool is %timeit and looking at the source code for specific functions (?? in ipython).

In this specific example I would consider using fillna in an explicit for loop of the columns you want:

In [31]: df = pd.DataFrame([[1, 2, np.nan], [4, np.nan, 6]], columns=['A', 'B', 'C'])

In [32]: for col in ["A", "B"]:
   ....:     df[col].fillna(df[col].mean(), inplace=True)
   ....:

In [33]: df
Out[33]:
   A  B   C
0  1  2 NaN
1  4  2   6

(Perhaps it makes sense for fillna to have columns argument for this usecase?)

All of this isn't to say pandas is memory inefficient... but efficient (and memory efficient) code sometimes has to be thought about.

*apply is not usually going to make sense inplace (and IMO this behaviour would rarely be desired).

Pandas data frame behavior

Answers (1)

Related Questions