antonpp
antonpp

Reputation: 2373

Information retrieval - looking for term synonyms

It is quite a broad question and I am not looking for concrete implementation (well, if something, that solves this problem, already exists that would be awesome). If anyone can give my any idea on how requested information can retrieved, that would be perfect.

Let me describe the problem on an example. I have a name of the University (Oxford University, for instance). And I am going to filter twitter in order to find twits that mention this university. Obviously, most of them would not contain directly words "Oxford university" but instead something like "Oxon", "Oxf" or just "Oxford" might be used.

My question is how one can automatically find all synonyms for a word (more precise - I am only interested in universities' names).

Upvotes: 3

Views: 304

Answers (3)

borowis
borowis

Reputation: 1235

Depending on the language and platform that you use there are available NER extractors, like for Java there's a library from Standford that you could use, so no need to write your own. Please also see this answer for Java, it has even more useful tools.

After running the tool you could browse different categories to identify related things first visually (like Oxford Oxf MIT etc.), and then perhaps you would need to do some postprocessing by running stemming/doing word clustering with word2vec etc.

Upvotes: 0

Alikbar
Alikbar

Reputation: 696

These kind of problems don't have simple straight solutions, but You can implement this paper : Named Entity Recognition from Tweets

And if you want to read more about this problem, search for named entity recognition (NER).

Upvotes: 3

Lockless
Lockless

Reputation: 497

Typically the answer to this would be to use word stemming. Trouble is that you are not using dictionary words. University names typically have a large number of abbreviations that do not follow a convention. The next logical step would be to use a regex but twitter does not support regex for searching, all information must searched generally and be post processed.

So best bet is to use a combination of query operators to narrow your search as much as possible https://dev.twitter.com/rest/public/search. Then post process on your server side. Though this is a non elegant answer with a lot of manual work I don't see another methodology.

Upvotes: 1

Related Questions