pow
pow

Reputation: 435

plotting PCA output in scatter plot whilst colouring according to to label python matplotlib

I have just completed a PCA analysis of 14 variables which I have chosen to condense into 2 components.

pca = PCA(n_components=2)
pca.fit(z)
a = pca.fit_transform(z)

The output this gives is in form:

[[ -3.84514275e+00  -1.19829226e-01]
 [ -4.78476227e+00  -1.35986090e-01]
 [ -2.26702900e+00  -1.19665853e+00]
 [ -5.01021616e+00   2.76005130e+00]
 [ -5.57580326e+00  -2.00656680e+00]
 [ -5.08184415e+00  -3.68721491e+00]
 [ -3.41505366e+00  -7.61184868e-01]
 [ -4.92439159e+00  -1.82147509e+00]
...
 [ -3.34931300e+00   7.57884906e-01]]

I want to do the following:

  1. plot each observation on a scattergraph with PC1 (x) being the first value in each array and PC2 (y) being the 2nd value.

  2. colour each observation according to the corresponding label type (i.e. A=red, B=blue, C=green, etc) from the initial pre-PCA data.

  3. label SELECTED (not ALL) observations with the name of the observation from the initial pre-PCA data (i.e. John, Peter, Sally, etc.)

any help is greatly appreciated for any/all of these problems.

Worth noting I attempted to do the scatter by:

plt.scatter(a[1], a[2])
plt.show()

but obviously this doesn't work as my output of a is not seperated by commas and would only plot 2 points. Can't help my head around it so would appreciate SO's input.

EDIT:

dataset in form:

John, A, var1, var2, var3, ..., var14
Peter, A, var1, var2, var3, ..., var14
Sally, B, var1, var2, var3, ..., var14
Cath, C, var1, var2, var3, ..., var14
Jim, A, var1, var2, var3, ..., var14

I'm after something similar to this:

enter image description here

Upvotes: 4

Views: 15961

Answers (1)

WhoIsJack
WhoIsJack

Reputation: 1498

I think your question is now very clear - thanks for editing!

Here's how the plot you describe can be created.


First, let's generate some example data:

# Params
n_samples  = 100
m_features =  14
selected_names = ['name_13', 'name_23', 'name_42', 'name_66']

# Generate
np.random.seed(42)
names    = ['name_%i' % i for i in range(n_samples)]
labels   = [np.random.choice(['A','B','C','D']) for i in range(n_samples)]
features = np.random.random((n_samples,m_features))

Next we do the PCA:

pca = PCA(n_components=2)
features_pca = pca.fit_transform(features)

Then we prepare a list/array of length n that translates the labels A,B,C,... into colors. These can either be hand-selected colors...

# Label to color dict (manual)
label_color_dict = {'A':'red','B':'green','C':'blue','D':'magenta'}

# Color vector creation
cvec = [label_color_dict[label] for label in labels]

...or just a range of integers.

# Label to color dict (automatic)
label_color_dict = {label:idx for idx,label in enumerate(np.unique(labels))}

# Color vector creation
cvec = [label_color_dict[label] for label in labels]

Finally, it's time to plot.

# Create the scatter plot
plt.figure(figsize=(8,8))
plt.scatter(features_pca[:,0], features_pca[:,1],
            c=cvec, edgecolor='', alpha=0.5)

# Add the labels
for name in selected_names:

    # Get the index of the name
    i = names.index(name)

    # Add the text label
    labelpad = 0.01   # Adjust this based on your dataset
    plt.text(features_pca[i,0]+labelpad, features_pca[i,1]+labelpad, name, fontsize=9)

    # Mark the labeled observations with a star marker
    plt.scatter(features_pca[i,0], features_pca[i,1],
                c=cvec[i], vmin=min(cvec), vmax=max(cvec),
                edgecolor='', marker='*', s=100)

# Add the axis labels
plt.xlabel('PC 1 (%.2f%%)' % (pca.explained_variance_ratio_[0]*100))
plt.ylabel('PC 2 (%.2f%%)' % (pca.explained_variance_ratio_[1]*100)) 

# Done
plt.show()

As you can see, the different colors can be fed into plt.scatter via the c kwarg. In addition, I recommend edgecolor='' as this often looks more clear. You can play with alpha to increase/decrease transparency, which will make the labeled points stand out more/less.

The labels are simply placed on the plot using plt.text with the appropriate x and y positions, which I here adjust a little bit (using labelpad) so that the labels are nicely positioned next to the marker.

For the star marker, note that vmin and vmax are important if you are using a numeric color vector, since otherwise the stars will end up in the wrong color.

Here's the result (using the manually defined colors):

enter image description here

Upvotes: 7

Related Questions