Reza
Reza

Reputation: 157

Using sklearn GridSearchCV for for finding the optimized parameters with large data (15750 samples)

I am trying to use GridSearchCV from sklearn in Python to find the parameters for a SVM classifier. The training data is of the shape (15750,65536) (15750 samples, feature dimension: 65536).

Everything works fine with the default setting! However, if I want to use the parallel processing option, by defining n_jobs I face the following problem: the data is loaded into the memory (on a machine with 48 GB RAM, it takes about 14% of the whole memory), but it never starts the gridsearch/training! On (h)top the process status is S (so it is basically stopped!). It continues occupying the memory, but never starts running (CPU usage remains zero!).

I tried different values for n_jobs, like 2,3-5 (the machine has 8 cores). But no luck! According to the documentation, with large data, one could use pre_dispatch option in GridSearchCV, so that the number of copied data is limited and memory problems are avoided. So I tried even with n_job=2, and pre_dispatch=1, and still nothing works!

I should also mention that I tried the same code with much less number of samples, like 1000 of them, and again everything was fine! However, the question is, given that for one process the data just takes 15% of the machine memory, why it is not able to run on at least two cores, with a pre_dispatch=2 ?? It should then take around 30% of the machine memory. But why the process is just stopped? And even no memory error? And if there is a way around it?

Here is the piece of code for doing the job (taken mainly from sklearn documentation):

sklearn version: 0.12.1 and python version: 2.7.3

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]




clf = GridSearchCV(SVC(C=1), tuned_parameters, n_jobs=2, verbose=3,  pre_dispatch=1)
clf.fit(tr, tt, cv=3)

Upvotes: 4

Views: 4067

Answers (1)

Moses Xu
Moses Xu

Reputation: 2160

Have you tried n_jobs = -1, which instructs sklearn to use all CPUs? This setting worked perfectly for me (though I have a much smaller number of training samples).

Upvotes: 1

Related Questions