Reputation: 5081
In the documentation for the PCA function in scikitlearn, there is a copy
argument that is True
by default.
The documentation says this about the argument:
If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
I'm not sure what this is saying, however, because how would the function overwrite the input X
? When you call .fit(X)
, the function should just be calculating the PCA vectors and updating the internal state of the PCA object, right?
So even if you set copy to False
, the .fit(X)
function should still be returning the object self as it says in the documentation, so shouldn't fit(X).transform(X)
still work?
So what is it copying when this argument is set to False
?
Additionally, when would I want to set it to False
?
Edit: I ran the fit and transform function together and separately and got different results even though the copy parameter was the same for both.
from sklearn.decomposition import PCA
import numpy as np
X = np.arange(20).reshape((5,4))
print("Separate")
XT = X.copy()
pcaT = PCA(n_components=2, copy=True)
print("Original: ", XT)
results = pcaT.fit(XT).transform(XT)
print("New: ", XT)
print("Results: ", results)
print("\nCombined")
XF = X.copy()
pcaF = PCA(n_components=2, copy=True)
print("Original: ", XF)
results = pcaF.fit_transform(XF)
print("New: ", XF)
print("Results: ", results)
########## Results
Separate
Original: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
New: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
Results: [[ 1.60000000e+01 -2.66453526e-15]
[ 8.00000000e+00 -1.33226763e-15]
[ 0.00000000e+00 0.00000000e+00]
[ -8.00000000e+00 1.33226763e-15]
[ -1.60000000e+01 2.66453526e-15]]
Combined
Original: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
New: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
Results: [[ 1.60000000e+01 1.44100598e-15]
[ 8.00000000e+00 -4.80335326e-16]
[ -0.00000000e+00 0.00000000e+00]
[ -8.00000000e+00 4.80335326e-16]
[ -1.60000000e+01 9.60670651e-16]]
Upvotes: 3
Views: 1584
Reputation:
In your example the value of copy
ends up being ignored, as explained below. But here is what can happen if you set it to False:
X = np.arange(20).reshape((5,4)).astype(np.float64)
print(X)
pca = PCA(n_components=2, copy=False).fit(X)
print(X)
This prints the original X
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[ 12. 13. 14. 15.]
[ 16. 17. 18. 19.]]
and then shows that X was mutated by fit
method.
[[-8. -8. -8. -8.]
[-4. -4. -4. -4.]
[ 0. 0. 0. 0.]
[ 4. 4. 4. 4.]
[ 8. 8. 8. 8.]]
The culprit is this line: X -= self.mean_
, where augmented assignment mutates the array.
If you set copy=True, which is the default value, then X is not mutated.
Why has not copy
made a difference in your example? The only thing that the method PCA.fit does with the value of copy
is pass it to a utility function check_array
which is called to make sure the data matrix has datatype either float32 or float64. If the data type isn't one of those, type conversion happens, and that creates a copy anyway (in your example, there is conversion from int to float). This is why in my example above I made X a float array.
fit().transform()
and fit_transform()
You also asked why fit(X).transform(X)
and fit_transform(X)
return slightly different results. This has nothing to do with copy
parameter. The difference is within the errors of double-precision arithmetics. And it comes from the following:
fit
performs the SVD as X = U @ S @ V.T
(where @ means matrix multiplication) and stores V in the components_
property. transform
multiplies the data by V
fit_transform
performs the same SVD as fit
does, and then returns U @ S
Mathematically, U @ S
is the same as X @ V
because V is an orthogonal matrix. But the errors of floating-point arithmetic result in tiny differences.
It makes sense that fit_transform
does U @ S
instead of X @ V
; it's a simpler and more accurate multiplication to perform because S
is diagonal. The reason fit
does not do the same is that only V
is stored, and in any case it doesn't really know that the argument it got was the same that the model got to fit
.
Upvotes: 4