Reputation: 61
I am solving a multilabel classification problem. I have about 6 Million of rows to be processed which are huge chunks of text. They are tagged with multiple tags in a separate column.
Any advice on what scikit libraries can help me scale up my code. I am using One-vs-Rest and SVM within it. But they don't scale beyond 90-100k rows.
classifier = Pipeline([
('vectorizer', CountVectorizer(min_df=1)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
Upvotes: 6
Views: 2490
Reputation: 89
HashingVectorizer
will work if you iteratively chunk your data into batches of 10k or 100k documents that fit in memory for instance.
You can then pass the batch of transformed documents to a linear classifier that supports the partial_fit
method (e.g. SGDClassifier
or PassiveAggressiveClassifier
) and then iterate on new batches.
You can start scoring the model on a held-out validation set (e.g. 10k documents) as you go to monitor the accuracy of the partially trained model without waiting for having seen all the samples.
You can also do this in parallel on several machines on partitions of the data and then average the resulting coef_
and intercept_
attribute to get a final linear model for the all dataset.
I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736
There is also sample code in this tutorial on paralyzing scikit-learn with IPython.parallel taken from: https://github.com/ogrisel/parallel_ml_tutorial
Upvotes: 0
Reputation: 2960
SVM's scale well as the number of columns increase, but poorly with the number of rows, as they are essentially learning which rows constitute the support vectors. I have seen this as a common complaint with SVM's, but most people don't understand why, as they typically scale well for most reasonable datasets.
Upvotes: 3