user248237
user248237

Reputation:

implementing R scale function in pandas in Python?

What is the efficient equivalent of R's scale function in pandas? E.g.

newdf <- scale(df)

written in pandas? Is there an elegant way using transform?

Upvotes: 9

Views: 7327

Answers (2)

herrfz
herrfz

Reputation: 4904

Scaling is very common in machine learning tasks, so it is implemented in scikit-learn's preprocessing module. You can pass pandas DataFrame to its scale method.

The only "problem" is that the returned object is no longer a DataFrame, but a numpy array; which is usually not a real issue if you want to pass it to a machine learning model anyway (e.g. SVM or logistic regression). If you want to keep the DataFrame, it would require some workaround:

from sklearn.preprocessing import scale
from pandas import DataFrame

newdf = DataFrame(scale(df), index=df.index, columns=df.columns)

See also here.

Upvotes: 12

Phillip Cloud
Phillip Cloud

Reputation: 25692

I don't know R, but from reading the documentation it looks like the following would do the trick (albeit in a slightly less general way)

def scale(y, c=True, sc=True):
    x = y.copy()

    if c:
        x -= x.mean()
    if sc and c:
        x /= x.std()
    elif sc:
        x /= np.sqrt(x.pow(2).sum().div(x.count() - 1))
    return x

For the more general version you'd probably need to do some type/length checking.

EDIT: Added explanation of the denominator in elif sc: clause

From the R docs:

 ... If ‘scale’ is
 ‘TRUE’ then scaling is done by dividing the (centered) columns of
 ‘x’ by their standard deviations if ‘center’ is ‘TRUE’, and the
 root mean square otherwise.  If ‘scale’ is ‘FALSE’, no scaling is
 done.

 The root-mean-square for a (possibly centered) column is defined
 as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing
 values and n is the number of non-missing values.  In the case
 ‘center = TRUE’, this is the same as the standard deviation, but
 in general it is not.

The line np.sqrt(x.pow(2).sum().div(x.count() - 1)) computes the root mean square using the definition by first squaring x (the pow method) then summing along the rows and then dividing by the non NaN counts in each column (the count method).

As a side the note the reason I didn't just simply compute the RMS after centering is because the std method calls bottleneck for faster computation of that expression in that special case where you want to compute the standard deviation and not the more general RMS.

You could instead compute the RMS after centering, might be worth a benchmark since now that I'm writing this I'm not actually sure which is faster and I haven't benchmarked it.

Upvotes: 8

Related Questions