Question about applying PCA with one Component

Question

I have a set of data that I've been assigned to apply PCA and retain one component and then visualize the distribution in a scatter plot which indicates the class of each data point.

For context: The data we're working with has three columns. X is column 1 and 2 and y is column 3 which contains the class of each data point.

It was implied that the resulting visualization should be a horizontal line, but I'm not seeing that. The resulting visualization is a scatter plot that looks like a positive linear distribution.

import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap


df = pd.read_csv("data.csv", header=None)
X = df.iloc[:, 0:2].values
y = df.iloc[:,-1].values
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=np.random)
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

pcaObj1 = PCA(n_components=1)
X_train_PCA = pcaObj1.fit_transform(X_train)
X_test_PCA = pcaObj1.transform(X_test)
X_set, y_set = X_test_PCA, y_test
X3 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01))
X3 = np.array(X3)

plt.xlim(X3.min(), X3.max())
plt.ylim(X3.min(), X3.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 0],
                c = ListedColormap(('purple', 'yellow'))(i), label = j)

Galen · Accepted Answer

I see that you have a test set in addition to a training set, however this not the usual setup for PCA. PCA has multiple applications, but one of the main ones is dimensionality reduction. Dimensionality reduction is about removing variables, and PCA serves this purpose by changing the basis of your data and ordering them by the amount (or relative amount) of the total variation that they linearly explain. Since this does not require test data, we can think of this as unsupervised machine learning, although many would also prefer to call this feature engineering as it is often used to preprocess data to improve the performance of models trained on that preprocessed data.

Let me generate a random dataset with 10 variables and 1000 entries for the sake of example. Fitting the PCA transform for 1 component, you're selecting a new variable (feature) that is a linear combination of the original variables that attempts to linearly explain the most variance in the data. As you say, it is a number line; just as a quick-and-easy plot let's just use the x-axis as the index of the new variable array and the y-axis as the value of the variable.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA

X_train = np.random.random((1000, 10))
y_labels = np.array([0] * 500 + [1] * 500)

pcaObj1 = PCA(n_components=1)
X_PCA = pcaObj1.fit_transform(X_train)

plt.scatter(range(len(y_labels)), X_PCA, c=['red' if i==0 else 'green' for i in y_labels])
plt.show()

You can see this produces a 1000 x 1 array representing your new variable.

>>> X_PCA.shape
(1000, 1)

If you had selected n_components=2 instead, you'd have a 1000 x 2 array with two such variables. Let's see that as example. This time I'll plot the two principal components against each other instead of using a single principal component against its index.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA

X_train = np.random.random((1000, 10))
y_labels = np.array([0] * 500 + [1] * 500)

pcaObj1 = PCA(n_components=2)
X_PCA = pcaObj1.fit_transform(X_train)

plt.scatter(X_PCA[:,0], X_PCA[:,1], c=['red' if i==0 else 'green' for i in y_labels])
plt.show()

Now, my randomly-generated data may not have the same properties as your data set. If you really expect the output to be a line, then I'd say certainly not as my example generates a very eratic trace. You'll see even in the 2D case that the data doesn't seem structured by class, but that's what you would expect from random data.

Question about applying PCA with one Component

Answers (2)

Related Questions