Reputation: 523
I am trying out PCA (principal component analysis) in Spark ML.
data = [(Vectors.dense([1.0, 1.0]),),
(Vectors.dense([1.0, 2.0]),),
(Vectors.dense([4.0, 4.0]),),
(Vectors.dense([5.0, 4.0]),)]
df = spark.createDataFrame(data, ["features"])
pca = PCA(k=1, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
transformed_feature = model.transform(df)
transformed_feature.show()
Output:
+---------+--------------------+
| features| pcaFeatures|
+---------+--------------------+
|[1.0,1.0]|[-1.3949716649258...|
|[1.0,2.0]|[-1.976209858644928]|
|[4.0,4.0]|[-5.579886659703326]|
|[5.0,4.0]|[-6.393620130910061]|
+---------+--------------------+
When I tried PCA on same data in scikit-learn as below it given different result
X = np.array([[1.0, 1.0], [1.0, 2.0], [4.0, 4.0], [5.0, 4.0]])
pca = PCA(n_components=1)
pca.fit(X)
X_transformed = pca.transform(X)
for x,y in zip(X ,X_transformed):
print(x,y)
Output:
[ 1. 1.] [-2.44120041]
[ 1. 2.] [-1.85996222]
[ 4. 4.] [ 1.74371458]
[ 5. 4.] [ 2.55744805]
As you can see there is a difference in output.
To verify the result i calculated PCA for the same data mathematically. I got same result as it from scikit-learn. Below snippet is of pca transformation calculation for first data point (1.0,1.0):
as you can see it matches with scikit learn result.
It seems spark ML doesn't subtract the mean vector MX from data vector X i.e. it uses Y = A*(X)
in place of Y = A*(X-MX)
.
For point (1.0,1.0):
Y = (0.814*1.0)+(0.581*1.0)) = 1.395
which is same result which we got with spark ML.
Is Spark ML is giving wrong result or am I missing something?
Upvotes: 6
Views: 2007
Reputation: 28332
In Spark, the PCA transformation will not scale the input data automatically for you. You need to take care of that yourself before applying the method. To normalize the mean of the data, StandardScaler
can be used in the following way:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
withStd=False, withMean=True)
scaled_df = scaler.fit(df).transform(df)
The PCA method can then be applied on the scaled_df
in the same way as before and the results will match what was given by scikit-learn.
I would recommend to make use of the Spark ML pipeline to simplify the process. To use the standardization and PCA together, it could look like this:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
withStd=False, withMean=True)
pca = PCA(k=1, inputCol=scaler.getOutputCol(), outputCol="pcaFeatures")
pipeline = Pipeline(stages=[scaler , pca])
model = pipeline.fit(df)
transformed_feature = model.transform(df)
Upvotes: 5