Reputation: 2966
I am trying to generate a Universal Background Model (UBM) based on a huge array of extracted MFCC features but I keep getting aMemoryError
when I am fitting my data to the model. Here is the relevant code section:
files_features.shape
(2469082, 56)
gmm = GMM(n_components = 1024, n_iter = 512, covariance_type = 'full', n_init = 3)
gmm.fit(features)
Is there a way to solve this error or to decompose the processing of the data to avoid the memory error. I am quite new to this field and would appreciate any help I get.
[Update]
Unfortunately the answers mentioned in here do not solve my issue since the data_set is assumed to have low variance, whereas in my case:
round(np.var(files_features), 3)
47.781
Incremental fitting is maybe a solution but scikit-learn
does not have such a partial_fit
for GMMs. I would appreciate any suggestions on how to tackle this, whether alternative libs suggestions, partial_fit reference implementations or processing the data batch by batch (which does not work in this case because GMM.fit()
is memory-less) ?
Upvotes: 1
Views: 2265
Reputation: 2966
For those who have the same issue, I recommend the use of the Bob library, which supports big data processing and even offers parallel processing.
In my use-case Bob was a great fit for the development of GMM-UBM systems, as all the relevant functionalities are already implemented.
Upvotes: 1
Reputation: 2072
That's fairly straightforward using Dask.
Just use Dask's DataFrame instead of pandas', and everything else should work without any changes.
As an alternative to scikit-learn, you can use Turis' Graphlab Create, which could handle arbitrary large datasets (though I'm not sure it supports GMM).
Upvotes: 2