PCA output in Spark doesn't matches with scikit-learn

Question

I am trying out PCA (principal component analysis) in Spark ML.

data = [(Vectors.dense([1.0, 1.0]),),
  (Vectors.dense([1.0, 2.0]),),
  (Vectors.dense([4.0, 4.0]),), 
  (Vectors.dense([5.0, 4.0]),)]

df = spark.createDataFrame(data, ["features"])
pca = PCA(k=1, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
transformed_feature = model.transform(df)
transformed_feature.show()

Output:

+---------+--------------------+
| features|         pcaFeatures|
+---------+--------------------+
|[1.0,1.0]|[-1.3949716649258...|
|[1.0,2.0]|[-1.976209858644928]|
|[4.0,4.0]|[-5.579886659703326]|
|[5.0,4.0]|[-6.393620130910061]|
+---------+--------------------+

When I tried PCA on same data in scikit-learn as below it given different result

X = np.array([[1.0, 1.0], [1.0, 2.0], [4.0, 4.0], [5.0, 4.0]])
pca = PCA(n_components=1)
pca.fit(X)
X_transformed = pca.transform(X)
for x,y in zip(X ,X_transformed):
    print(x,y)

Output:

[ 1.  1.] [-2.44120041]
[ 1.  2.] [-1.85996222]
[ 4.  4.] [ 1.74371458]
[ 5.  4.] [ 2.55744805]

As you can see there is a difference in output.

To verify the result i calculated PCA for the same data mathematically. I got same result as it from scikit-learn. Below snippet is of pca transformation calculation for first data point (1.0,1.0):

as you can see it matches with scikit learn result.

It seems spark ML doesn't subtract the mean vector MX from data vector X i.e. it uses Y = A*(X) in place of Y = A*(X-MX).

For point (1.0,1.0):

Y = (0.814*1.0)+(0.581*1.0)) = 1.395

which is same result which we got with spark ML.

Is Spark ML is giving wrong result or am I missing something?

PCA output in Spark doesn't matches with scikit-learn

Answers (1)

Related Questions

PCA output in Spark doesn&#39;t matches with scikit-learn

Answers (1)

Related Questions

PCA output in Spark doesn't matches with scikit-learn