samsamara
samsamara

Reputation: 4750

applying word2vec on small text files

I'm totally new to word2vec so please bear it with me. I have a set of text files each containing a set of tweets, between 1000-3000. I have chosen a common keyword ("kw1") and I want to find semantically relevant terms for "kw1" using word2vec. For example if the keyword is "apple" I would expect to see related terms such as "ipad" "os" "mac"... based on the input file. So this set of related terms for "kw1" would be different for each input file as word2vec would be trained on individual files (eg., 5 input files, run word2vec 5 times on each file).

My goal is to find sets of related terms for each input file given the common keyword ("kw1"), which would be used for some other purposes.

My questions/doubts are:

I have downloaded the code from code.google.com: https://code.google.com/p/word2vec/ and have just given it a dry run as follows:

 time ./word2vec -train $file -output vectors.bin -cbow 1 -size 200 -window 10 -negative 25 -hs 1 -sample 1e-3 -threads 12 -binary 1 -iter 50

./distance vectors.bin 

Upvotes: 4

Views: 1124

Answers (1)

ozborn
ozborn

Reputation: 1070

First Question:

Yes, for almost any task that I can imagine word2vec being applied to you are going to have to clean the data - especially if you are interested in semantics (not syntax) which is the usual reason to run word2vec. Also, it is not just about removing stopwords although that is a good first step. Typically you are going to want to have a tokenizer and sentence segmenter as well, I think if you look at the document for deeplearning4java (which has a word2vec implementation) it shows using these tools. This is important since you probably don't care about the relationship between apple and the number "5", apple and "'s", etc... For more discussion on preprocessing for word2vec see https://groups.google.com/forum/#!topic/word2vec-toolkit/TI-TQC-b53w

Second Question:

There is no automatic tuning available for word2vec AFAIK, since that implys the author of the implementation knows what you plan to do with it. Typically default values for the implementation are the "best" values for whoever implemented on a (or a set of) tasks. Sorry, word2vec isn't a turn-key solution. You will need to understand the parameters and adjust them to fix your task accordingly.

Upvotes: 4

Related Questions