BenB
BenB

Reputation: 1050

Gaussian mixture model (GMM) gives a bad fit

I've been playing with the Scikit-learn's GMM function. To start with, I've just created a distribution along the line x=y.

from sklearn import mixture
import numpy as np 
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

line_model = mixture.GMM(n_components = 99)
#Create evenly distributed points between 0 and 1.
xs = np.linspace(0, 1, 100)
ys = np.linspace(0, 1, 100)

#Create a distribution that's centred along y=x
line_model.fit(zip(xs,ys))
plt.plot(xs, ys)
plt.show()

This produces the expected distribution: The distribution

Next I fit a GMM to it, and plot the results:

#Create the x,y mesh that will be used to make a 3D plot
x_y_grid = []
for x in xs:
    for y in ys:
        x_y_grid.append([x,y])

#Calculate a probability for each point in the x,y grid.
x_y_z_grid = []
for x,y in x_y_grid:
    z = line_model.score([[x,y]])
    x_y_z_grid.append([x,y,z])

x_y_z_grid = np.array(x_y_z_grid)

#Plot probabilities on the Z axis.
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot(x_y_z_grid[:,0], x_y_z_grid[:,1], 2.72**x_y_z_grid[:,2])
plt.show()

The resulting probability distribution has some weird tails along x=0 and x=1 and also extra probability in the corners (x=1, y=1 and x=0,y=0). Probability distribution n=99

Using n_components=5 also shows this behaviour: Probability distribution n=5

Is this something inherent with GMMs, or is there an issue with the implementation, or am I doing something wrong?

Edit: getting scores from the model seems to get rid of this behaviour -- should this be?

I'm training both the models on the same dataset (x=y from x=0 to x=1). Simply checking the probability via the score method of the gmm seems to eliminate this boundary effect. Why is this? I've attached the plots and code below.

Checking the scores over different domains affects the distribution.

# Creates a line of 'observations' between (x_small_start, x_small_end)
# and (y_small_start, y_small_end). This is the data both gmms are trained on.
x_small_start = 0
x_small_end = 1
y_small_start = 0
y_small_end = 1

# These are the range of values that will be plotted
x_big_start = -1
x_big_end = 2
y_big_start = -1
y_big_end = 2


shorter_eval_range_gmm = mixture.GMM(n_components = 5)
longer_eval_range_gmm = mixture.GMM(n_components = 5)

x_small = np.linspace(x_small_start, x_small_end, 100)
y_small = np.linspace(y_small_start, y_small_end, 100)
x_big = np.linspace(x_big_start, x_big_end, 100)
y_big = np.linspace(y_big_start, y_big_end, 100)

#Train both gmms on a distribution that's centered along y=x
shorter_eval_range_gmm.fit(zip(x_small,y_small))
longer_eval_range_gmm.fit(zip(x_small,y_small))


#Create the x,y meshes that will be used to make a 3D plot
x_y_evals_grid_big = []
for x in x_big:
    for y in y_big:
        x_y_evals_grid_big.append([x,y])
x_y_evals_grid_small = []

for x in x_small:
    for y in y_small:
        x_y_evals_grid_small.append([x,y])

#Calculate a probability for each point in the x,y grid.
x_y_z_plot_grid_big = []
for x,y in x_y_evals_grid_big:
    z = longer_eval_range_gmm.score([[x, y]])
    x_y_z_plot_grid_big.append([x, y, z])
x_y_z_plot_grid_big = np.array(x_y_z_plot_grid_big)

x_y_z_plot_grid_small = []
for x,y in x_y_evals_grid_small:
    z = shorter_eval_range_gmm.score([[x, y]])
    x_y_z_plot_grid_small.append([x, y, z])
x_y_z_plot_grid_small = np.array(x_y_z_plot_grid_small)


#Plot probabilities on the Z axis.
fig = plt.figure()
fig.suptitle("Probability of different x,y pairs")

ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax1.plot(x_y_z_plot_grid_big[:,0], x_y_z_plot_grid_big[:,1], np.exp(x_y_z_plot_grid_big[:,2]))
ax1.set_xlabel('X Label')
ax1.set_ylabel('Y Label')
ax1.set_zlabel('Probability')
ax2 = fig.add_subplot(1, 2, 2, projection='3d')
ax2.plot(x_y_z_plot_grid_small[:,0], x_y_z_plot_grid_small[:,1], np.exp(x_y_z_plot_grid_small[:,2]))
ax2.set_xlabel('X Label')
ax2.set_ylabel('Y Label')
ax2.set_zlabel('Probability')

plt.show()

Upvotes: 2

Views: 3027

Answers (2)

Kyle Kastner
Kyle Kastner

Reputation: 1018

EDIT: This is not correct. Talking with Ronald P., you can't get Gibbs effects because the Gaussians cannot compensate each other by "going negative", as probability is strictly > 0. This seems to be a simple plotting issue... see his answer instead! Either way, I would recommend using 2D data to test GMMs, rather than a 1D line.

The GMM is fitting to the data you gave it - specifically:

xs = np.linspace(0, 1, 100)
ys = np.linspace(0, 1, 100)

Because the data ends at 0 and 1, the GMM is attempting to model that fact: -.01 and 1.01 are technically outside the trained data range and should be scored with very low probabilities. In doing so it ends up creating a gaussian with smaller spread (smaller covariance/higher precision) to cover the ends of the data and model the fact that the data stops.

I would expect that adding enough gaussians would lead to a pseudo-Gibbs phenomena effect, and you can kind of see that happening in the change from 5 to 99. To exactly model the edges, you would need an infinite mixture model. This is analogous to infinite frequency components - you are representing a "signal" with a set of basis functions (in this case, gaussians) in GMM as well!

Upvotes: 1

Ronald P
Ronald P

Reputation: 126

There is no problem with the fit, but with the visualisation you're using. A hint should be the straight line connecting (0,1,5) to (0,1,0), which is actually just a rendering of the connection of two points (which is due to the order in which the points are read). Although the two points at its extrema are in your data, no other point on this line actually is.

Personally, I think it is a rather bad idea to use 3d plots (wires) to represent a surface for the reason mentioned above, and I would recommend surface plots or contour plots instead.

Try this:

from sklearn import mixture
import numpy as np 
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

line_model = mixture.GMM(n_components = 99)
#Create evenly distributed points between 0 and 1.
xs = np.atleast_2d(np.linspace(0, 1, 100)).T
ys = np.atleast_2d(np.linspace(0, 1, 100)).T

#Create a distribution that's centred along y=x
line_model.fit(np.concatenate([xs, ys], axis=1))
plt.scatter(xs, ys)
plt.show()

#Create the x,y mesh that will be used to make a 3D plot
X, Y = np.meshgrid(xs, ys)
x_y_grid = np.c_[X.ravel(), Y.ravel()]

#Calculate a probability for each point in the x,y grid.
z = line_model.score(x_y_grid)
z = z.reshape(X.shape)

#Plot probabilities on the Z axis.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, z)
plt.show()

From an academic point of view I am quite uncomfortable with the goal of fitting a 1D line in a 2D space by a 2D mixture model. Manifold learning with GMMs requires at least the normal direction to have zero variance, reducing thus to a dirac-distribution. Numerically and analytically this is unstable, and should be avoided (there seems to be some stabilising trick in the gmm fit, since variance of the model is rather large in the direction of the normal to the straight line).

It is also recommended to use plt.scatter rather than plt.plot when drawing data, since there is no reason to connect the dots when you're fitting their joint distribution.

Hope this helps to shed some light on your problem.

Upvotes: 5

Related Questions