Ailef
Ailef

Reputation: 7906

Given 5 input words predict the "most associated" word

I have to solve this task for a NLP homework. The task is as general as I described it in the title. A set of 2000 examples, with the corresponding expected output, is provided and they look like:

absence ~ away fonder illness leave presence
absent ~ away minded gone present ill
absurdity ~ stupid ridiculous mad stupidity clown
accents ~ dialects language foreign speech French
accordion ~ music piano play player instrument

I already solved the task using distributional semantics with a decent accuracy over this set, but the problem is that I have an additional constraint, that is: the size of the archive I deliver must be less than 50 MB (as far as I'm concerned this constraint is totally nonsense, but still I have to comply). Any distributional semantics approach will therefore not work, because the semantic space has to be built over a lot of data (thousands of Wikipedia pages, in my case) and its size can't be reduced so much to fit into 50 MB.

Can you suggest any other approaches that I can use to tackle this problem?

Upvotes: 0

Views: 100

Answers (1)

mbatchkarov
mbatchkarov

Reputation: 16049

This happens often in the scientific literature when data has to be shared. Usually one would submit the resource (word vectors in your case), plus the code that was used to build them and a link to the raw data (e.g. wikipedia). You should also distribute any other code that is needed with the resource (e.g. code to query the model for words most associated with a given target).

In your case, provided you used sensible dimensionality reduction, you should be able to fit a decent-coverage distributional model in 50 MB. The models I am working with right now take about 150MB to store 70k word vectors in uncompressed plain text (plus there is a lot of overhead due to the specific format I am using). I can zip that down to 37MB.

Upvotes: 1

Related Questions