Reputation:
What is the efficient equivalent of R's scale
function in pandas? E.g.
newdf <- scale(df)
written in pandas? Is there an elegant way using transform
?
Upvotes: 9
Views: 7327
Reputation: 4904
Scaling is very common in machine learning tasks, so it is implemented in scikit-learn's preprocessing
module. You can pass pandas DataFrame to its scale
method.
The only "problem" is that the returned object is no longer a DataFrame, but a numpy array; which is usually not a real issue if you want to pass it to a machine learning model anyway (e.g. SVM or logistic regression). If you want to keep the DataFrame, it would require some workaround:
from sklearn.preprocessing import scale
from pandas import DataFrame
newdf = DataFrame(scale(df), index=df.index, columns=df.columns)
See also here.
Upvotes: 12
Reputation: 25692
I don't know R, but from reading the documentation it looks like the following would do the trick (albeit in a slightly less general way)
def scale(y, c=True, sc=True):
x = y.copy()
if c:
x -= x.mean()
if sc and c:
x /= x.std()
elif sc:
x /= np.sqrt(x.pow(2).sum().div(x.count() - 1))
return x
For the more general version you'd probably need to do some type/length checking.
EDIT: Added explanation of the denominator in elif sc:
clause
From the R docs:
... If ‘scale’ is
‘TRUE’ then scaling is done by dividing the (centered) columns of
‘x’ by their standard deviations if ‘center’ is ‘TRUE’, and the
root mean square otherwise. If ‘scale’ is ‘FALSE’, no scaling is
done.
The root-mean-square for a (possibly centered) column is defined
as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing
values and n is the number of non-missing values. In the case
‘center = TRUE’, this is the same as the standard deviation, but
in general it is not.
The line np.sqrt(x.pow(2).sum().div(x.count() - 1))
computes the root mean square using the definition by first squaring x
(the pow
method) then summing along the rows and then dividing by the non NaN
counts in each column (the count
method).
As a side the note the reason I didn't just simply compute the RMS after centering is because the std
method calls bottleneck
for faster computation of that expression in that special case where you want to compute the standard deviation and not the more general RMS.
You could instead compute the RMS after centering, might be worth a benchmark since now that I'm writing this I'm not actually sure which is faster and I haven't benchmarked it.
Upvotes: 8