Can anyone explain how to get BIDMach's Word2vec to work?

Question

In a paper titled, "Machine Learning at the Limit," Canny, et. al. report substantial word2vec processing speed improvements.

I'm working with the BIDMach library used in this paper, and cannot find any resource that explains how Word2Vec is implemented or how it should be used within this framework.

There are several scripts in the repo:

I've tried running them (after building the referenced tparse2.exe file) with no success.

I've tried modifying them to get them to run but have nothing but errors come back.

I emailed the author, and posted an issue on the github repo, but have gotten nothing back. I only got somebody else having the same troubles, who says he got it to run but at much slower speeds than reported on newer GPU hardware.

I've searched all over trying to find anyone that has used this library to achieve these speeds with no luck. There are multiple references floating around that point to this library as the fastest implementation out there, and cite the numbers in the paper:

Intel research references the reported numbers without running the code on GPU (they cite numbers reported in the original paper)
old reddit post pointing to BIDMach as the best (but the OP says "I haven't tested BIDMach myself yet")
SO post citing BIDMach as the best (OP doesn't actually run the library to make this claim...)
many more not worth listing citing BIDMach as the best/fastest without example or claims of "I haven't tested myself..."

When I search for a similiar library (gensim), and the import code required to run it, I find thousands of results and tutorials but a similar search for the BIDMach code yields only the BIDMach repo.

This BIDMach implementation certainly carries the reputation for being the best, but can anyone out there tell me how to use it?

All I want to do is run a simple training process to compare it to a handful of other implementations on my own hardware.

Every other implementation of this concept I can find either has works with the original shell script test file, provides actual instructions, or provides shell scripts of their own to test.

UPDATE: The author of the library has added additional shell scripts to get the previously mentioned scripts running, but exactly what they mean or how they work is still a total mystery and I can't understand how to get the word2vec training procedure to run on my own data.

EDIT (for bounty)

I'll give out the bounty to anywone that can explain how I'd use my own corpus (text8 would be great), and then train a model, and then save the ouput vectors and the vocabulary to files that can be read by Omar Levy's Hyperwords.

This is exactly what the original C implementation would do with arguments -binary 1 -output vectors.bin -save-vocab vocab.txt

This is also what Intel's implementation does, and other CUDA implementations, etc, so this is a great way to generate something that can be easily compared with other versions...

UPDATE (bounty expired without answer) John Canny has updated a few scripts in the repo and added a fmt.txt file, thus making it possible to run test scripts that are package in the repo.

However, my attempt to run this with the text8 corpus yields near 0% accuracy on they hyperwords test.

Running the training process on the billion word benchmark (which is what the repo scripts now do) also yields well-below-average accuracy on the hyperwords test.

So, either the library never yielded accuracy on these tests, or I'm still missing something in my setup.

The issue remains open on github.

Can anyone explain how to get BIDMach's Word2vec to work?

Answers (1)

Related Questions

Can anyone explain how to get BIDMach&#39;s Word2vec to work?

Answers (1)

Related Questions

Can anyone explain how to get BIDMach's Word2vec to work?