Reputation: 6081
I'm currently experimenting with a ML task that involves supervised training of a classification model. To date, I've got ~5M training examples and ~5M examples for cross-validation. Each example has, at the moment, 46 features, however I might want to generate 10 more in the near future, so any solution should leave some room for improvement.
My problem is the following: what tool do I use to tackle this problem? I'd like to use random forests or SVM, however I'm afraid that the latter might be too slow in my case. I've considered Mahout, but turned away as it appears to require a certain amount of configuration coupled with messing with command line scripts. I'd rather code directly against some (well documented!) library or define my model with a GUI.
I should also specify that I'm looking for something that will run on Windows (without things such as cygwin), and that solutions that play well with .NET are much appreciated.
You can imagine that, when the time will, come, the code will be run on a Cluster Compute Eight Extra Large Instance on Amazon EC2, so anything that makes wide use of RAM and multi-core CPUs is welcome.
Last but not least, I shall specify that my dataset is dense (in that there's no missing value / all columns have a value for each vector)
Upvotes: 6
Views: 1726
Reputation: 12142
I would recommend looking at stochastic gradient descent for this scale of a problem. A good tool to look at is VowpalWabbit. At that size you can probably run your experiments on a desktop with reasonable specs. The only downside for you, I think is that it is not Windows centric, but although I haven't checked it should run on cygwin.
EDIT: There has been great interest from the developers to get VowpalWabbit running on Windows. As of March 2013 VowpalWabbit (version 7.2) runs on Windows out of the box. There are a couple of advanced/optional features that are not yet implemented on Windows, one of them is running VowpalWabbit as a daemon, but it seems that will get handled in the short term future.
Upvotes: 2
Reputation: 246
I routinely run similar row/feature count datasets in R
on EC2 (the 16 core / 60 Gb instance type you are referring to is particularly useful in case if you are using a method that can take advantage of multiple cpus such as package caret
.) As you've mentioned, not all learning methods (such as SVM) are going to perform well on such dataset though.
You may want to consider using a 10% sample or so for quick prototyping / performance benchmarking before switching to running on the full dataset.
If you want extremely high performance then Vowpal Wabbit is a better fit (but it only supports generalized linear learners so no gbm
or Random Forest
.) Besides, VW is not very windows-friendly.
Upvotes: 3