Reputation: 659
I need to de-normalize data that was normalized using the MinMaxScaler
method of ML in Spark.
I was able to normalize my data following these steps: Spark: convert an RDD[LabeledPoint] to a Dataframe to apply MinMaxScaler, and after scaling get the normalized RDD[LabeledPoint] that I've posted earlier.
For example, the original df
had the two first columns and, after scaling, the result was:
+------+--------------------+--------------------+
|labels| features| featuresScaled|
+------+--------------------+--------------------+
| 1.0|[6.0,7.0,42.0,1.1...|[1.0,0.2142857142...|
| 1.0|[6.0,18.0,108.0,3...|[1.0,1.0,1.0,1.0,...|
| 1.0|[5.0,7.0,35.0,1.4...|[0.0,0.2142857142...|
| 1.0|[5.0,8.0,40.0,1.6...|[0.0,0.2857142857...|
| 1.0|[6.0,4.0,24.0,0.6...|[1.0,0.0,0.0,0.0,...|
+------+--------------------+--------------------+
The problem is, now I need to do the opposite process: de-normalize.
To do so, I need the min
and max
values for each feature column inside the features
vector, and the values to be denormalized.
To get min
and max
, I ask to the MinMaxScaler
as follows:
val df_fitted = scaler.fit(df_all)
val df_fitted_original_min = df_fited.originalMin // Vector
val df_fitted_original_max = df_fited.originalMax // Vector
df_fited_original_min[1.0,1.0,7.0,0.007,0.052,0.062,1.0,1.0,7.0,1.0]
df_fited_original_max[804.0,553.0,143993.0,537.0,1.0,1.0,4955.0,28093.0,42821.0,3212.0]
And, on the other hand, I have the DataFrame as this:
+--------------------+-----+--------------------+--------------------+-----+-----+--------------------+--------------------+--------------------+-----+
| col_0|col_1| col_2| col_3|col_4|col_5| col_6| col_7| col_8|col_9|
+--------------------+-----+--------------------+--------------------+-----+-----+--------------------+--------------------+--------------------+-----+
|0.009069428120139292| 0.0|9.015488712438252E-6|2.150418860440459E-4| 1.0| 1.0|0.001470074844665...|2.205824685144127...|2.780971210319238...| 0.0|
|0.008070826019024355| 0.0|3.379696051366339...|2.389342641479033...| 1.0| 1.0|0.001308210192425627|1.962949264985630...|1.042521123176856...| 0.0|
|0.009774715414895803| 0.0|1.299590589291292...|1.981673063697640...| 1.0| 1.0|0.001584395736407...|2.377361424206848...| 4.00879434193585E-5| 0.0|
|0.009631155146285946| 0.0|1.218569739510422...|2.016021040879828E-4| 1.0| 1.0|0.001561125874539...|2.342445354515269...|3.758872615157643E-5| 0.0|
Now, I need to apply the following equation to get the new values, but I do not how can I make it.
X_original = ( X_scaled * (max - min) ) + min
For each position in the DF, I have to apply this equation with the corresponding max
and min
value into the vector.
For example: In the first row and column of the DF is 0.009069428120139292
. In the same column, the corresponding min
and max
values are: 1.0
and 804.0
.
So, the denormalized value is:
X_den = ( 0.009069428120139292 * (804.0 - 1.0) ) + 1.0
It is necessary to clarify that the DF that was normalized in first place, during the program was modified. Due to that I need apply the de-normalization (if not, the easiest way is to a keep a copy of the original DF).
Upvotes: 3
Views: 737
Reputation: 659
I've got the answer from the following one https://stackoverflow.com/a/50314767/9759150, plus a slightly adaptation to my problem I've completed the de-normalization process.
Let's consider normalized_df
as the dataframe with 10 columns (showd in my question):
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val updateFunction = (columnValue: Column, minValue: Int, maxValue: Int) =>
(columnValue * ( lit(maxValue) - lit(minValue))) + lit(minValue)
val updateColumns = (df: DataFrame, minVector: Vector, maxVector: Vector, updateFunction: (Column, Int, Int) => Column) => {
val columns = df.columns
minVector.toArray.zipWithIndex.map{
case (updateValue, index) =>
updateFunction( col(columns(index.toInt)), minVector(index).toInt, maxVector(index).toInt ).as(columns(index.toInt))
}
}
var dfUpdated = normalized_df.select(
updateColumns(normalized_df, df_fitted_original_min, df_fitted_original_max, updateFunction) :_*
)
Upvotes: 1
Reputation: 77837
You "simply" apply the inverse operations in the opposite order. The equation is in the documentation here. The code of interest is:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
You now have the data set of X_saled
values, and you want to recover the original X
values. Your immedicate problem is that you lose some basic information in the transformation. X_scaled
is a set of data on the range [0, 1]; you have no way of knowing what the original range was.
To make this work, find and keep the original min
and `max values. Now, it's easy to reverse the linear transformation for each element:
X_original = X_scaled * (max - min) + min
Can you take it from there?
Upvotes: 2