rasen58
rasen58

Reputation: 5081

What does copy=False do in sklearn

In the documentation for the PCA function in scikitlearn, there is a copy argument that is True by default.

The documentation says this about the argument:
If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.

I'm not sure what this is saying, however, because how would the function overwrite the input X? When you call .fit(X), the function should just be calculating the PCA vectors and updating the internal state of the PCA object, right? So even if you set copy to False, the .fit(X) function should still be returning the object self as it says in the documentation, so shouldn't fit(X).transform(X) still work?

So what is it copying when this argument is set to False?

Additionally, when would I want to set it to False?

Edit: I ran the fit and transform function together and separately and got different results even though the copy parameter was the same for both.

from sklearn.decomposition import PCA
import numpy as np

X = np.arange(20).reshape((5,4))

print("Separate")
XT = X.copy()
pcaT = PCA(n_components=2, copy=True)
print("Original: ", XT)
results = pcaT.fit(XT).transform(XT)
print("New: ", XT)
print("Results: ", results)

print("\nCombined")
XF = X.copy()
pcaF = PCA(n_components=2, copy=True) 
print("Original: ", XF)
results = pcaF.fit_transform(XF)
print("New: ", XF)
print("Results: ", results)

########## Results
Separate
Original:  [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]
New:  [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]
Results:  [[  1.60000000e+01  -2.66453526e-15]
 [  8.00000000e+00  -1.33226763e-15]
 [  0.00000000e+00   0.00000000e+00]
 [ -8.00000000e+00   1.33226763e-15]
 [ -1.60000000e+01   2.66453526e-15]]

Combined
Original:  [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]
New:  [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]
Results:  [[  1.60000000e+01   1.44100598e-15]
 [  8.00000000e+00  -4.80335326e-16]
 [ -0.00000000e+00   0.00000000e+00]
 [ -8.00000000e+00   4.80335326e-16]
 [ -1.60000000e+01   9.60670651e-16]]

Upvotes: 3

Views: 1584

Answers (1)

user6655984
user6655984

Reputation:

In your example the value of copy ends up being ignored, as explained below. But here is what can happen if you set it to False:

X = np.arange(20).reshape((5,4)).astype(np.float64)
print(X)
pca = PCA(n_components=2, copy=False).fit(X)
print(X)

This prints the original X

[[  0.   1.   2.   3.]
 [  4.   5.   6.   7.]
 [  8.   9.  10.  11.]
 [ 12.  13.  14.  15.]
 [ 16.  17.  18.  19.]]

and then shows that X was mutated by fit method.

[[-8. -8. -8. -8.]
 [-4. -4. -4. -4.]
 [ 0.  0.  0.  0.]
 [ 4.  4.  4.  4.]
 [ 8.  8.  8.  8.]]

The culprit is this line: X -= self.mean_, where augmented assignment mutates the array.

If you set copy=True, which is the default value, then X is not mutated.

Copy is sometimes made even if copy=False

Why has not copy made a difference in your example? The only thing that the method PCA.fit does with the value of copy is pass it to a utility function check_array which is called to make sure the data matrix has datatype either float32 or float64. If the data type isn't one of those, type conversion happens, and that creates a copy anyway (in your example, there is conversion from int to float). This is why in my example above I made X a float array.

Tiny differences between fit().transform() and fit_transform()

You also asked why fit(X).transform(X) and fit_transform(X) return slightly different results. This has nothing to do with copy parameter. The difference is within the errors of double-precision arithmetics. And it comes from the following:

  • fit performs the SVD as X = U @ S @ V.T (where @ means matrix multiplication) and stores V in the components_ property.
  • transform multiplies the data by V
  • fit_transform performs the same SVD as fit does, and then returns U @ S

Mathematically, U @ S is the same as X @ V because V is an orthogonal matrix. But the errors of floating-point arithmetic result in tiny differences.

It makes sense that fit_transform does U @ S instead of X @ V; it's a simpler and more accurate multiplication to perform because S is diagonal. The reason fit does not do the same is that only V is stored, and in any case it doesn't really know that the argument it got was the same that the model got to fit.

Upvotes: 4

Related Questions