Reputation: 435
I have just completed a PCA analysis of 14 variables which I have chosen to condense into 2 components.
pca = PCA(n_components=2)
pca.fit(z)
a = pca.fit_transform(z)
The output this gives is in form:
[[ -3.84514275e+00 -1.19829226e-01]
[ -4.78476227e+00 -1.35986090e-01]
[ -2.26702900e+00 -1.19665853e+00]
[ -5.01021616e+00 2.76005130e+00]
[ -5.57580326e+00 -2.00656680e+00]
[ -5.08184415e+00 -3.68721491e+00]
[ -3.41505366e+00 -7.61184868e-01]
[ -4.92439159e+00 -1.82147509e+00]
...
[ -3.34931300e+00 7.57884906e-01]]
I want to do the following:
plot each observation on a scattergraph with PC1 (x) being the first value in each array and PC2 (y) being the 2nd value.
colour each observation according to the corresponding label type (i.e. A=red, B=blue, C=green, etc) from the initial pre-PCA data.
label SELECTED (not ALL) observations with the name of the observation from the initial pre-PCA data (i.e. John, Peter, Sally, etc.)
any help is greatly appreciated for any/all of these problems.
Worth noting I attempted to do the scatter by:
plt.scatter(a[1], a[2])
plt.show()
but obviously this doesn't work as my output of a is not seperated by commas and would only plot 2 points. Can't help my head around it so would appreciate SO's input.
EDIT:
dataset in form:
John, A, var1, var2, var3, ..., var14
Peter, A, var1, var2, var3, ..., var14
Sally, B, var1, var2, var3, ..., var14
Cath, C, var1, var2, var3, ..., var14
Jim, A, var1, var2, var3, ..., var14
I'm after something similar to this:
Upvotes: 4
Views: 15961
Reputation: 1498
I think your question is now very clear - thanks for editing!
Here's how the plot you describe can be created.
First, let's generate some example data:
# Params
n_samples = 100
m_features = 14
selected_names = ['name_13', 'name_23', 'name_42', 'name_66']
# Generate
np.random.seed(42)
names = ['name_%i' % i for i in range(n_samples)]
labels = [np.random.choice(['A','B','C','D']) for i in range(n_samples)]
features = np.random.random((n_samples,m_features))
Next we do the PCA:
pca = PCA(n_components=2)
features_pca = pca.fit_transform(features)
Then we prepare a list/array of length n
that translates the labels A,B,C,...
into colors. These can either be hand-selected colors...
# Label to color dict (manual)
label_color_dict = {'A':'red','B':'green','C':'blue','D':'magenta'}
# Color vector creation
cvec = [label_color_dict[label] for label in labels]
...or just a range of integers.
# Label to color dict (automatic)
label_color_dict = {label:idx for idx,label in enumerate(np.unique(labels))}
# Color vector creation
cvec = [label_color_dict[label] for label in labels]
Finally, it's time to plot.
# Create the scatter plot
plt.figure(figsize=(8,8))
plt.scatter(features_pca[:,0], features_pca[:,1],
c=cvec, edgecolor='', alpha=0.5)
# Add the labels
for name in selected_names:
# Get the index of the name
i = names.index(name)
# Add the text label
labelpad = 0.01 # Adjust this based on your dataset
plt.text(features_pca[i,0]+labelpad, features_pca[i,1]+labelpad, name, fontsize=9)
# Mark the labeled observations with a star marker
plt.scatter(features_pca[i,0], features_pca[i,1],
c=cvec[i], vmin=min(cvec), vmax=max(cvec),
edgecolor='', marker='*', s=100)
# Add the axis labels
plt.xlabel('PC 1 (%.2f%%)' % (pca.explained_variance_ratio_[0]*100))
plt.ylabel('PC 2 (%.2f%%)' % (pca.explained_variance_ratio_[1]*100))
# Done
plt.show()
As you can see, the different colors can be fed into plt.scatter
via the c
kwarg. In addition, I recommend edgecolor=''
as this often looks more clear. You can play with alpha
to increase/decrease transparency, which will make the labeled points stand out more/less.
The labels are simply placed on the plot using plt.text
with the appropriate x and y positions, which I here adjust a little bit (using labelpad
) so that the labels are nicely positioned next to the marker.
For the star marker, note that vmin
and vmax
are important if you are using a numeric color vector, since otherwise the stars will end up in the wrong color.
Here's the result (using the manually defined colors):
Upvotes: 7