Sheldon
Sheldon

Reputation: 4633

SVM: plot decision surface when working with more than 2 features

I am working with scikit-learn's breast cancer dataset, consisting of 30 features. Following this tutorial for the much less depressing iris dataset, I figured how to plot the decision surface separating the "benign" and "malignant" categories, when considering the dataset's first two features (mean radius and mean texture).

This is what I get:

enter image description here

But how to represent the hyperplane computed when using all features in the dataset? I am aware that I cannot plot a graph in 30 dimensions, but I would like to "project" the hyperplane created when running svm.SVC(kernel='linear', C=1).fit(X_train, y_train) onto the 2D scatter plot showing mean texture against mean radius.

I read about using PCA to reduce dimensionality, but I suspect that fitting a "reduced" dataset is not the same as projecting the hyperplane computed over all 30 features onto a 2D plot.


Here is my code so far:

from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import svm
import numpy as np

#Load dataset
cancer = datasets.load_breast_cancer()

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.3,random_state=109) # 70% training and 30% test

h = .02  # mesh step    
C = 1.0  # Regularisation
clf = svm.SVC(kernel='linear', C=C).fit(X_train[:,:2], y_train) # Linear Kernel


x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))


Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
scat=plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train) 
legend1 = plt.legend(*scat.legend_elements(),
                    loc="upper right", title="diagnostic")
plt.xlabel('mean_radius')
plt.ylabel('mean_texture')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.show()

Upvotes: 2

Views: 7537

Answers (2)

seralouk
seralouk

Reputation: 33137

You cannot visualize the decision surface for a lot of features. This is because the dimensions will be too many and there is no way to visualize an N-dimensional surface.

I have also written an article about this here: https://towardsdatascience.com/support-vector-machines-svm-clearly-explained-a-python-tutorial-for-classification-problems-29c539f3ad8?source=friends_link&sk=80f72ab272550d76a0cc3730d7c8af35

However, you can use 2 features and plot nice decision surfaces as follows.

Case 1: 2D plot for 2 features and using the iris dataset

from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
y = iris.target

def make_meshgrid(x, y, h=.02):
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    return xx, yy

def plot_contours(ax, clf, xx, yy, **params):
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out

model = svm.SVC(kernel='linear')
clf = model.fit(X, y)

fig, ax = plt.subplots()
# title for the plots
title = ('Decision surface of linear SVC ')
# Set-up grid for plotting.
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)

plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_ylabel('y label here')
ax.set_xlabel('x label here')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
ax.legend()
plt.show()

enter image description here

Case 2: 3D plot for 3 features and using the iris dataset

from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from mpl_toolkits.mplot3d import Axes3D

iris = datasets.load_iris()
X = iris.data[:, :3]  # we only take the first three features.
Y = iris.target

#make it binary classification problem
X = X[np.logical_or(Y==0,Y==1)]
Y = Y[np.logical_or(Y==0,Y==1)]

model = svm.SVC(kernel='linear')
clf = model.fit(X, Y)

# The equation of the separating plane is given by all x so that np.dot(svc.coef_[0], x) + b = 0.
# Solve for w3 (z)
z = lambda x,y: (-clf.intercept_[0]-clf.coef_[0][0]*x -clf.coef_[0][1]*y) / clf.coef_[0][2]

tmp = np.linspace(-5,5,30)
x,y = np.meshgrid(tmp,tmp)

fig = plt.figure()
ax  = fig.add_subplot(111, projection='3d')
ax.plot3D(X[Y==0,0], X[Y==0,1], X[Y==0,2],'ob')
ax.plot3D(X[Y==1,0], X[Y==1,1], X[Y==1,2],'sr')
ax.plot_surface(x, y, z(x,y))
ax.view_init(30, 60)
plt.show()

enter image description here

Upvotes: 2

Zabir Al Nazi Nabil
Zabir Al Nazi Nabil

Reputation: 11198

You can't plot the 30-dim data without any transformation to 2-d.

https://github.com/tmadl/highdimensional-decision-boundary-plot

What is a Voronoi Tessellation? Given a set P := {p1, ..., pn} of sites, a Voronoi Tessellation is a subdivision of the space into n cells, one for each site in P, with the property that a point q lies in the cell corresponding to a site pi iff d(pi, q) < d(pj, q) for i distinct from j. The segments in a Voronoi Tessellation correspond to all points in the plane equidistant to the two nearest sites. Voronoi Tessellations have applications in computer science. - https://philogb.github.io/blog/2010/02/12/voronoi-tessellation/

In geometry, a centroidal Voronoi tessellation (CVT) is a special type of Voronoi tessellation or Voronoi diagram. A Voronoi tessellation is called centroidal when the generating point of each Voronoi cell is also its centroid, i.e. the arithmetic mean or center of mass. It can be viewed as an optimal partition corresponding to an optimal distribution of generators. A number of algorithms can be used to generate centroidal Voronoi tessellations, including Lloyd's algorithm for K-means clustering or Quasi-Newton methods like BFGS. - Wiki

import numpy as np, matplotlib.pyplot as plt
from sklearn.neighbors.classification import KNeighborsClassifier
from sklearn.datasets.base import load_breast_cancer
from sklearn.manifold.t_sne import TSNE
from sklearn import svm


bcd = load_breast_cancer()
X,y = bcd.data, bcd.target
X_Train_embedded = TSNE(n_components=2).fit_transform(X)
print(X_Train_embedded.shape)

h = .02  # mesh step    
C = 1.0  # Regularisation

clf = svm.SVC(kernel='linear', C=C) # Linear Kernel
clf = clf.fit(X,y)
y_predicted = clf.predict(X)


resolution = 100 # 100x100 background pixels
X2d_xmin, X2d_xmax = np.min(X_Train_embedded[:,0]), np.max(X_Train_embedded[:,0])
X2d_ymin, X2d_ymax = np.min(X_Train_embedded[:,1]), np.max(X_Train_embedded[:,1])
xx, yy = np.meshgrid(np.linspace(X2d_xmin, X2d_xmax, resolution), np.linspace(X2d_ymin, X2d_ymax, resolution))

# approximate Voronoi tesselation on resolution x resolution grid using 1-NN
background_model = KNeighborsClassifier(n_neighbors=1).fit(X_Train_embedded, y_predicted) 
voronoiBackground = background_model.predict(np.c_[xx.ravel(), yy.ravel()])
voronoiBackground = voronoiBackground.reshape((resolution, resolution))

#plot
plt.contourf(xx, yy, voronoiBackground)
plt.scatter(X_Train_embedded[:,0], X_Train_embedded[:,1], c=y)
plt.show()

enter image description here

Upvotes: 1

Related Questions