SuperKogito
SuperKogito

Reputation: 2966

How to solve MemoryError in sklearn when fitting huge data to a GMM?

I am trying to generate a Universal Background Model (UBM) based on a huge array of extracted MFCC features but I keep getting aMemoryError when I am fitting my data to the model. Here is the relevant code section:

files_features.shape
(2469082, 56)
gmm = GMM(n_components = 1024, n_iter = 512, covariance_type = 'full', n_init = 3)
gmm.fit(features)

Is there a way to solve this error or to decompose the processing of the data to avoid the memory error. I am quite new to this field and would appreciate any help I get.

[Update]

Unfortunately the answers mentioned in here do not solve my issue since the data_set is assumed to have low variance, whereas in my case:

round(np.var(files_features), 3)
47.781

Incremental fitting is maybe a solution but scikit-learn does not have such a partial_fit for GMMs. I would appreciate any suggestions on how to tackle this, whether alternative libs suggestions, partial_fit reference implementations or processing the data batch by batch (which does not work in this case because GMM.fit() is memory-less) ?

Upvotes: 1

Views: 2265

Answers (2)

SuperKogito
SuperKogito

Reputation: 2966

For those who have the same issue, I recommend the use of the Bob library, which supports big data processing and even offers parallel processing.

In my use-case Bob was a great fit for the development of GMM-UBM systems, as all the relevant functionalities are already implemented.

Upvotes: 1

Qusai Alothman
Qusai Alothman

Reputation: 2072

That's fairly straightforward using Dask.
Just use Dask's DataFrame instead of pandas', and everything else should work without any changes.

As an alternative to scikit-learn, you can use Turis' Graphlab Create, which could handle arbitrary large datasets (though I'm not sure it supports GMM).

Upvotes: 2

Related Questions