Reputation: 222
I was doing some scaling on below dataset using spark MLlib:
+---+--------------+
| id| features|
+---+--------------+
| 0|[1.0,0.1,-1.0]|
| 1| [2.0,1.1,1.0]|
| 0|[1.0,0.1,-1.0]|
| 1| [2.0,1.1,1.0]|
| 1|[3.0,10.1,3.0]|
+---+--------------+
You can find the link of this dataset at https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/simple-ml-scaling/part-00000-cd03406a-cc9b-42b0-9299-1e259fdd9382-c000.gz.parquet
After performing standard scaling I am getting the below result:
+---+--------------+------------------------------------------------------------+
|id |features |stdScal_06f7a85f98ef__output |
+---+--------------+------------------------------------------------------------+
|0 |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1 |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968] |
|0 |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1 |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968] |
|1 |[3.0,10.1,3.0]|[3.5856858280031805,2.3609991401715313,1.7928429140015902] |
+---+--------------+------------------------------------------------------------+
If I perform min/max scaling (setting val minMax = new MinMaxScaler().setMin(5).setMax(10).setInputCol("features")
), I get the below:
+---+--------------+-------------------------------+
| id| features|minMaxScal_21493d63e2bf__output|
+---+--------------+-------------------------------+
| 0|[1.0,0.1,-1.0]| [5.0,5.0,5.0]|
| 1| [2.0,1.1,1.0]| [7.5,5.5,7.5]|
| 0|[1.0,0.1,-1.0]| [5.0,5.0,5.0]|
| 1| [2.0,1.1,1.0]| [7.5,5.5,7.5]|
| 1|[3.0,10.1,3.0]| [10.0,10.0,10.0]|
+---+--------------+-------------------------------+
Please find the code below:
// loading dataset
val scaleDF = spark.read.parquet("/data/simple-ml-scaling")
// using standardScaler
import org.apache.spark.ml.feature.StandardScaler
val ss = new StandardScaler().setInputCol("features")
ss.fit(scaleDF).transform(scaleDF).show(false)
// using min/max scaler
import org.apache.spark.ml.feature.MinMaxScaler
val minMax = new MinMaxScaler().setMin(5).setMax(10).setInputCol("features")
val fittedminMax = minMax.fit(scaleDF)
fittedminMax.transform(scaleDF).show()
I know the formula for standarization and min/max scaling but unable to understand how it comes to the values in third column, please help me explain the math behind it.
Upvotes: 1
Views: 264
Reputation: 28352
MinMaxScaler
in Spark works on each feature individually. From the documentation we have:
Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling.
$$ Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min $$
[...]
So each column in the features
array will be scaled separately.
In this case, the MinMaxScaler
is set to have a minimum value of 5 and a maximum value of 10.
The calculation for each column will thus be:
Upvotes: 1