Reputation: 6332
I have a followup question on: How to normalize with PCA and scikit-learn.
I'm creating an emotion detection system and what I do now is:
I should do the normalization at: step 2) Normalize all combined data, and step 4) normalize the subsets.
I was wondering if the normalization over all data and the normalization over subset is the same. Now when I tried to simplify my example on suggestion of @BartoszKP I figured out that how I understood the normalization worked, was wrong. The normalization in both cases work in the same way, so this is a valid way to do it, right? (see code)
from sklearn.preprocessing import normalize
from sklearn.decomposition import RandomizedPCA
import numpy as np
data_1 = np.array(([52, 254], [4, 128]), dtype='f')
data_2 = np.array(([39, 213], [123, 7]), dtype='f')
data_combined = np.vstack((data_1, data_2))
#print(data_combined)
"""
Output
[[ 52. 254.]
[ 4. 128.]
[ 39. 213.]
[ 123. 7.]]
"""
#Normalize all data
data_norm = normalize(data_combined)
print(data_norm)
"""
[[ 0.20056452 0.97968054]
[ 0.03123475 0.99951208]
[ 0.18010448 0.98364753]
[ 0.99838448 0.05681863]]
"""
pca = RandomizedPCA(n_components=20, whiten=True)
pca.fit(data_norm)
#Normalize subset of data
data_1_norm = normalize(data_1)
print(data_1_norm)
"""
[[ 0.20056452 0.97968054]
[ 0.03123475 0.99951208]]
"""
pca.transform(data_1_norm)
Upvotes: 2
Views: 2618
Reputation: 35911
Yes, as explained in the documentation, what normalize
does, is scaling individual samples, independently to others:
Normalization is the process of scaling individual samples to have unit norm.
This is additionally explained in the documentation of the Normalizer
class:
Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.
(emphasis mine)
Upvotes: 1