Reputation: 2711
In a paper titled, "Machine Learning at the Limit," Canny, et. al. report substantial word2vec processing speed improvements.
I'm working with the BIDMach library used in this paper, and cannot find any resource that explains how Word2Vec is implemented or how it should be used within this framework.
There are several scripts in the repo:
I've tried running them (after building the referenced tparse2.exe
file) with no success.
I've tried modifying them to get them to run but have nothing but errors come back.
I emailed the author, and posted an issue on the github repo, but have gotten nothing back. I only got somebody else having the same troubles, who says he got it to run but at much slower speeds than reported on newer GPU hardware.
I've searched all over trying to find anyone that has used this library to achieve these speeds with no luck. There are multiple references floating around that point to this library as the fastest implementation out there, and cite the numbers in the paper:
When I search for a similiar library (gensim), and the import
code required to run it, I find thousands of results and tutorials but a similar search for the BIDMach code yields only the BIDMach repo.
This BIDMach implementation certainly carries the reputation for being the best, but can anyone out there tell me how to use it?
All I want to do is run a simple training process to compare it to a handful of other implementations on my own hardware.
Every other implementation of this concept I can find either has works with the original shell script test file, provides actual instructions, or provides shell scripts of their own to test.
UPDATE: The author of the library has added additional shell scripts to get the previously mentioned scripts running, but exactly what they mean or how they work is still a total mystery and I can't understand how to get the word2vec training procedure to run on my own data.
EDIT (for bounty)
I'll give out the bounty to anywone that can explain how I'd use my own corpus (text8 would be great), and then train a model, and then save the ouput vectors and the vocabulary to files that can be read by Omar Levy's Hyperwords.
This is exactly what the original C implementation would do with arguments -binary 1 -output vectors.bin -save-vocab vocab.txt
This is also what Intel's implementation does, and other CUDA implementations, etc, so this is a great way to generate something that can be easily compared with other versions...
UPDATE (bounty expired without answer)
John Canny has updated a few scripts in the repo and added a fmt.txt
file, thus making it possible to run test scripts that are package in the repo.
However, my attempt to run this with the text8 corpus yields near 0% accuracy on they hyperwords test.
Running the training process on the billion word benchmark (which is what the repo scripts now do) also yields well-below-average accuracy on the hyperwords test.
So, either the library never yielded accuracy on these tests, or I'm still missing something in my setup.
The issue remains open on github.
Upvotes: 15
Views: 452
Reputation: 11
BIDMach's Word2vec is a tool for learning vector representations of words, also known as word embeddings. To use Word2vec in BIDMach, you will need to first download and install BIDMach, which is an open-source machine learning library written in Scala. Once you have BIDMach installed, you can use the word2vec function to train a Word2vec model on a corpus of text data. This function takes a number of parameters, such as the size of the word vectors, the number of epochs to train for, and the type of model to use. You can find more detailed instructions and examples of how to use the word2vec function in the BIDMach documentation.
Upvotes: 1