itzik Ben Shabat
itzik Ben Shabat

Reputation: 927

How To Increase Sklearn GMM predict() Performance Speed?

I am using Sklearn to estimate the Gaussian Mixture Model (GMM) on some data.

After the estimation, I have many query points. I would like to obtain their probabilities of belonging to each of the estimated Gaussian.

The code below works. However, the gmm_sk.predict_proba(query_points) part is very slow as I need to run it multiple times on 100000 sets of samples, where each sample contains 1000 points.

I guess that it happens because it is sequential. Is there a way to make it parallel? Or any other way to make it faster? Maybe on GPU using TensorFlow?

I saw TensorFlow has its own GMM algorithm but it was very hard to implement.

Here is the code that I have written:

import numpy as np
from sklearn.mixture import GaussianMixture
import time


n_gaussians = 1000
covariance_type = 'diag'
points = np.array(np.random.rand(10000, 3), dtype=np.float32)
query_points = np.array(np.random.rand(1000, 3), dtype=np.float32)
start = time.time()

#GMM with sklearn
gmm_sk = GaussianMixture(n_components = n_gaussians, covariance_type=covariance_type)
gmm_sk.fit(points)
mid_t = time.time()
elapsed = time.time() - start
print("learning took "+ str(elapsed))

temp = []
for i in range(2000):
    temp.append(gmm_sk.predict_proba(query_points))

end_t = time.time() - mid_t
print("predictions took " + str(end_t))    

I solved it ! using multiprocessing. just replaced

temp = []
for i in range(2000):
    temp.append(gmm_sk.predict_proba(query_points))

with

import multiprocessing as mp
    query_points = query_points.tolist()
    parallel = mp.Pool()
    fv = parallel.map(par_gmm, query_points)
    parallel.close()
    parallel.join()

Upvotes: 3

Views: 3535

Answers (2)

Ufuk Can Bicici
Ufuk Can Bicici

Reputation: 3649

I see that your number of Gaussian components in the GMM is 1000, which, I think is a very large number, given your data dimensionality is relatively low (3). This is probably the reason that it runs slow, since it needs to evaluate 1000 separate Gaussians. If your sample count is low, then this is also very prone to overfitting. You can try a lower number of components, which will naturally be faster and will most likely generalize better.

Upvotes: 0

seralouk
seralouk

Reputation: 33147

You could speed up the process if you fit with the 'diagonal' or spherical covariance matrix instead of full.

Use:

covariance_type='diag'

or

covariance_type='spherical'

inside GaussianMixture

Also, try to decrease the Gaussian components.

However, keep in mind that this may affect the results but I cannot see other way to speed up the process.

Upvotes: 0

Related Questions