Reputation: 3699
The whole data set has 80
million samples, each sample have 200
dense features. We often train a classifier with batch processing. For example, we adopt the clf = sklearn.linear_model.SGDClassifier
, then we can use clf.partial_fit(batch_data, batch_y)
to fit the model with the batch data.
Before that, we should first scale the batch_data
. Suppose we use the mean-std
normalization. So we should obtain the global mean and standard deviations for each feature dimension. After that, we can use the global mean and stds to scale the batch_data.
Now the problem is how to obtain the mean and std of the whole data set. To compute the global std, we could use $\sigma^2 = E(X^2) - E(X)^2$. Then we should compute the E(X^2)
and E(X)
by batch processing.
I think the Hadoop
or Spark
might be suitable for this task. For each batch of data, we could start a instance to compute the partial E(X^2)
and E(X)
, then reduce them into the global ones.
In scikit-learn
, is there any more efficient way to scale the large data set? Maybe we could use the multithreading
or start multi processes to handle batch data, then reduce the results to get the global means and stds.
Upvotes: 2
Views: 1055
Reputation: 123
You can utilize n_jobs
option available in most of the scikit-learn
algorithms for parallel processing.
For data of this size, I will recommend to use apache spark.
Upvotes: 1