skywalkerytx
skywalkerytx

Reputation: 244

StandardScaler returns NaN

env:

spark-1.6.0 with scala-2.10.4

usage:

// row of df : DataFrame = (String,String,double,Vector) as (id1,id2,label,feature)
val df = sqlContext.read.parquet("data/Labeled.parquet")
val SC = new StandardScaler()
.setInputCol("feature").setOutputCol("scaled")
.setWithMean(false).setWithStd(true).fit(df) 


val scaled = SC.transform(df)
.drop("feature").withColumnRenamed("scaled","feature")

Code as the example here http://spark.apache.org/docs/latest/ml-features.html#standardscaler

NaN exists in scaled, SC.mean, SC.std

I don't understand why StandardScaler could do this even in mean or how to handle this situation. Any advice is appreciated.

data size as parquet is 1.6GiB, if anyone needs it just let me know

UPDATE:

Get through the code of StandardScaler and this is likely to be a problem of precision of Double when MultivariateOnlineSummarizer aggregated.

Upvotes: 1

Views: 1856

Answers (2)

Tiago_A
Tiago_A

Reputation: 1

One thing that I tried when faced the same problem is reseting the index from both os dataframes that I was manipulating, after the standartization procedure:

`df = df.reset_index() 
`df_norm = df_norm.reset_index()

Upvotes: 0

skywalkerytx
skywalkerytx

Reputation: 244

There is a value equals to Double.MaxValue and when StandardScaler sum the columns, result overflows.

Simply cast those column to scala.math.BigDecimal works.

ref here:

http://www.scala-lang.org/api/current/index.html#scala.math.BigDecimal

Upvotes: 2

Related Questions