Scaling dataset with MLlib

Question

I was doing some scaling on below dataset using spark MLlib:

+---+--------------+
| id|      features|
+---+--------------+
|  0|[1.0,0.1,-1.0]|
|  1| [2.0,1.1,1.0]|
|  0|[1.0,0.1,-1.0]|
|  1| [2.0,1.1,1.0]|
|  1|[3.0,10.1,3.0]|
+---+--------------+

You can find the link of this dataset at https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/simple-ml-scaling/part-00000-cd03406a-cc9b-42b0-9299-1e259fdd9382-c000.gz.parquet

After performing standard scaling I am getting the below result:

+---+--------------+------------------------------------------------------------+
|id |features      |stdScal_06f7a85f98ef__output                                |
+---+--------------+------------------------------------------------------------+
|0  |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1  |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968]   |
|0  |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1  |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968]   |
|1  |[3.0,10.1,3.0]|[3.5856858280031805,2.3609991401715313,1.7928429140015902]  |
+---+--------------+------------------------------------------------------------+

If I perform min/max scaling (setting val minMax = new MinMaxScaler().setMin(5).setMax(10).setInputCol("features")), I get the below:

+---+--------------+-------------------------------+
| id|      features|minMaxScal_21493d63e2bf__output|
+---+--------------+-------------------------------+
|  0|[1.0,0.1,-1.0]|                  [5.0,5.0,5.0]|
|  1| [2.0,1.1,1.0]|                  [7.5,5.5,7.5]|
|  0|[1.0,0.1,-1.0]|                  [5.0,5.0,5.0]|
|  1| [2.0,1.1,1.0]|                  [7.5,5.5,7.5]|
|  1|[3.0,10.1,3.0]|               [10.0,10.0,10.0]|
+---+--------------+-------------------------------+

Please find the code below:

// loading dataset
val scaleDF = spark.read.parquet("/data/simple-ml-scaling")
// using standardScaler
import org.apache.spark.ml.feature.StandardScaler
val ss = new StandardScaler().setInputCol("features") 
ss.fit(scaleDF).transform(scaleDF).show(false)

// using min/max scaler
import org.apache.spark.ml.feature.MinMaxScaler
val minMax = new MinMaxScaler().setMin(5).setMax(10).setInputCol("features") 
val fittedminMax = minMax.fit(scaleDF) 
fittedminMax.transform(scaleDF).show()

I know the formula for standarization and min/max scaling but unable to understand how it comes to the values in third column, please help me explain the math behind it.

Shaido · Accepted Answer

MinMaxScaler in Spark works on each feature individually. From the documentation we have:

Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling.

$$ Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min $$

[...]

So each column in the features array will be scaled separately. In this case, the MinMaxScaler is set to have a minimum value of 5 and a maximum value of 10.

The calculation for each column will thus be:

In the first column, the min value is 1.0 and the maximum is 3.0. We have 1.0 -> 5.0, and 3.0 -> 10.0. 2.0 will there for become 7.5.
In the second column, the min value is 0.1 and the maximum is 10.1. We have 0.1 -> 5.0 and 10.1 -> 10.0. The only other value in the column is 1.1 which will become ((1.1-0.1) / (10.1-0.1)) * (10.0 - 5.0) + 5.0 = 5.5 (following the normal min-max formula).
In the third column, the min value is -1.0 and the maximum is 3.0. So we know -1.0 -> 5.0 and 3.0 -> 10.0. For 1.0 it's in the middle and will become 7.5.

Scaling dataset with MLlib

Answers (1)

Related Questions