Reputation: 244
env:
spark-1.6.0 with scala-2.10.4
usage:
// row of df : DataFrame = (String,String,double,Vector) as (id1,id2,label,feature)
val df = sqlContext.read.parquet("data/Labeled.parquet")
val SC = new StandardScaler()
.setInputCol("feature").setOutputCol("scaled")
.setWithMean(false).setWithStd(true).fit(df)
val scaled = SC.transform(df)
.drop("feature").withColumnRenamed("scaled","feature")
Code as the example here http://spark.apache.org/docs/latest/ml-features.html#standardscaler
NaN exists in scaled
, SC.mean
, SC.std
I don't understand why StandardScaler
could do this even in mean
or how to handle this situation. Any advice is appreciated.
data size as parquet is 1.6GiB, if anyone needs it just let me know
UPDATE:
Get through the code of StandardScaler
and this is likely to be a problem of precision of Double
when MultivariateOnlineSummarizer
aggregated.
Upvotes: 1
Views: 1856
Reputation: 1
One thing that I tried when faced the same problem is reseting the index from both os dataframes that I was manipulating, after the standartization procedure:
`df = df.reset_index()
`df_norm = df_norm.reset_index()
Upvotes: 0
Reputation: 244
There is a value equals to Double.MaxValue
and when StandardScaler
sum the columns, result overflows.
Simply cast those column to scala.math.BigDecimal
works.
ref here:
http://www.scala-lang.org/api/current/index.html#scala.math.BigDecimal
Upvotes: 2