Reputation: 6181
I am using scikit-learn preprocessing scaling for sparse matrices.
My goal is to "scale" each feature-column by taking the logarithm-base the column maximum value. My wording may be inexact. I try to explain.
Say feature-column has values: 0, 8, 2
:
math.log(0+1, 8+1)
(the +1 is to cope with zeros; so yes, we are actually taking log-base 9)math.log(8+1, 8+1)
math.log(2+1, 8+1)
Yes, I can easily apply any arbitrary function-based transformer with FunctionTransformer, but I want the base of the log change (based on) each column (in particular, the maximum value). That is, I want to do something like the MaxAbsScaler, only taking logarithms.
I see that MaxAbsScaler
gets first a vector (scale
) of the maximum values of each column (code) and then multiples the original matrix times 1 / scale
in code.
However, I don't know what to do if I want to take the logarithms-based on the scale
vector. Is it even possible to transform the logarithm operation to a multiplication (?) or do I have other possibilities of scipy sparse operations that are efficient?
I hope my intent is clear (and possible).
Upvotes: 2
Views: 633
Reputation:
Logarithm of x in base b is the same as log(x)/log(b), where logs are natural. So, the process you describe amounts to first applying log(x+1) transformation to everything, and then scaling by max absolute value. Conveniently, log(x+1) is a built-in function, log1p
. Example:
from sklearn.preprocessing import FunctionTransformer, maxabs_scale
from scipy.sparse import csc_matrix
import numpy as np
logtran = FunctionTransformer(np.log1p, accept_sparse=True)
X = csc_matrix([[ 1., 0, 8], [ 2., 0, 0], [ 0, 1., 2]])
Y = maxabs_scale(logtran.transform(X))
Output (sparse matrix Y):
(0, 0) 0.630929753571
(1, 0) 1.0
(2, 1) 1.0
(0, 2) 1.0
(2, 2) 0.5
Upvotes: 3