Feature normalization algorithm in Spark

Question

Trying to understand Spark's normalization algorithm. My small test set contains 5 vectors:

{0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},  
{1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0},  
{-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0},  
{-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},  
{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0},

I would expect that new Normalizer().transform(vectors) creates JavaRDD where each vector feature is normalized as (v-mean)/stdev across all values for feature-0, `feature-1, etc.
The resulting set is:

[-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,0.9999999993877552]  
[1.357142668768307E-5,2.571428214508371E-7,0.0,3.428570952677828E-4,3.428570952677828E-4,2.057142571606697E-4,0.9999998611976999]  
[-1.357142668768307E-5,2.571428214508371E-7,0.0,3.428570952677828E-4,3.428570952677828E-4,2.057142571606697E-4,0.9999998611976999]  
[1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,0.9999999993877552]  
[0.0,0.0,0.0,0.0,0.0,0.0,1.0]

Note that all original values 7000.0 result in different 'normalized' values. Also, how, for example, 1.357142668768307E-5 was calculated when the values are: .95, 1,-1, -.95, 0? What's more, if I remove a feature, the results are different. Could not find any documentation on the issue.
In fact, my question is, how to normalize all vectors in RDD correctly?

Feature normalization algorithm in Spark

Answers (1)

Related Questions