How to solve MemoryError in sklearn when fitting huge data to a GMM?

Question

I am trying to generate a Universal Background Model (UBM) based on a huge array of extracted MFCC features but I keep getting aMemoryError when I am fitting my data to the model. Here is the relevant code section:

files_features.shape

(2469082, 56)

gmm = GMM(n_components = 1024, n_iter = 512, covariance_type = 'full', n_init = 3)
gmm.fit(features)

Is there a way to solve this error or to decompose the processing of the data to avoid the memory error. I am quite new to this field and would appreciate any help I get.

[Update]

Unfortunately the answers mentioned in here do not solve my issue since the data_set is assumed to have low variance, whereas in my case:

round(np.var(files_features), 3)

47.781

Incremental fitting is maybe a solution but scikit-learn does not have such a partial_fit for GMMs. I would appreciate any suggestions on how to tackle this, whether alternative libs suggestions, partial_fit reference implementations or processing the data batch by batch (which does not work in this case because GMM.fit() is memory-less) ?

SuperKogito · Accepted Answer

For those who have the same issue, I recommend the use of the Bob library, which supports big data processing and even offers parallel processing.

In my use-case Bob was a great fit for the development of GMM-UBM systems, as all the relevant functionalities are already implemented.

How to solve MemoryError in sklearn when fitting huge data to a GMM?

Answers (2)

Related Questions