mining
mining

Reputation: 3699

How to scale a large scale data in scikit-learn?

The whole data set has 80 million samples, each sample have 200 dense features. We often train a classifier with batch processing. For example, we adopt the clf = sklearn.linear_model.SGDClassifier, then we can use clf.partial_fit(batch_data, batch_y) to fit the model with the batch data.

Before that, we should first scale the batch_data. Suppose we use the mean-std normalization. So we should obtain the global mean and standard deviations for each feature dimension. After that, we can use the global mean and stds to scale the batch_data.

Now the problem is how to obtain the mean and std of the whole data set. To compute the global std, we could use $\sigma^2 = E(X^2) - E(X)^2$. Then we should compute the E(X^2) and E(X) by batch processing.

I think the Hadoop or Spark might be suitable for this task. For each batch of data, we could start a instance to compute the partial E(X^2) and E(X), then reduce them into the global ones.

In scikit-learn, is there any more efficient way to scale the large data set? Maybe we could use the multithreading or start multi processes to handle batch data, then reduce the results to get the global means and stds.

Upvotes: 2

Views: 1055

Answers (1)

Ranjan Kumar
Ranjan Kumar

Reputation: 123

You can utilize n_jobs option available in most of the scikit-learn algorithms for parallel processing.

For data of this size, I will recommend to use apache spark.

Upvotes: 1

Related Questions