machine learning - predicting one instance at a time - lots of instances - trying not to use I/O

Question

I have a large dataset and I'm trying to build a DAgger classifier for it. As you know, in the training time, I need to run the initial learned classifier on training instances (predict them), one instance at a time.

Libsvm is too slow even for the initial learning.

I'm using OLL but that needs each instance to be written to a file and then run the test code on it and get the prediction, this involves many disk I/O.

I have considered working with vowpal_wabbit (yet I'm not sure if it will help with disk I/O) but I don't have the permission to install it on the cluster I'm working with.

Liblinear is too slow and again needs disk I/O I believe. What are the other alternatives I can use?

Martin Popel · Accepted Answer

I recommend trying Vowpal Wabbit (VW). If Boost (and gcc or clang) is installed on the cluster, you can simply compile VW yourself (see the Tutorial). If Boost is not installed, you can compile it yourself as well.

VW contains more modern algorithms than OLL. Moreover, it contains several structured prediction algorithms (SEARN, DAgger) and also a C++ and Python interface to it. See an iPython notebook tutorial.

As for the disk I/O: for one-pass learning, you can pipe the input data directly to vw (cat data | vw) or run vw --daemon. For multi-pass learning, you must use cache file (the input data in binary fast-to-load format), which takes some time to create (during the first pass, unless it already existed), but the subsequent passes are much faster because of the binary format.

machine learning - predicting one instance at a time - lots of instances - trying not to use I/O

Answers (1)

Related Questions