Reputation: 2362

Words Prediction - Get most frequent predecessor and successor

Given a word I want to get the list of most frequent predecessors and successors of the word in English language. I have developed a code that does bigram analysis on any corpus ( I have used Enron email corpus) and can predict the most frequent next possible word but I want some other solution because a) I want to check the working / accuracy of my prediction b) Corpus or dataset based solutions fail for an unseen word

For example, given the word "excellent" I want to get the words that are most likely to come before excellent and after excellent

My question is whether any particular service or api exists for the purpose?

Upvotes: 2

Answers (3)

Josh S

Reputation: 147

I just re-read the original question and I realize the answers, mine included got off base. I think the original person just wanted to solve a simple programming problem, not look for datasets.

If you list all distinct word-pairs and count them, then you can answer your question with simple math on that list.

Of course you have to do a lot of processing to generate the list. While it's true that if the total number of distinct words is as much a 30,000 then there are a billion possible pairs, I doubt that in practice there are that many. So you can probably make a program with a huge hash table in memory (or on disk) and just count them all. If you don't need the insignificant pairs you could write a program that flushes out the less important ones periodically while scanning. Also you can segment the word list and generate pairs of a hundred words verses the rest, then the next hundred and so on, and calculate in passes.

My original answer is here I'm leaving it because it's my own related question:

I'm interested in something similar (I'm writing a entry system that suggest word completions and punctuation and I would like it to be multilingual).

I found a download page for google's ngram files, but they're not that good, they're full of scanning errors. 'i's become '1's, words run together etc. Hopefully Google has improved their scanning technology since then.

The just-download-wikipedia-unpack=it-and-strip-the-xml idea is a bust for me, I don't have a fast computer (heh, I have a choice between an atom netbook here and an android device). Imagine how long it would take me to unpack a 3 gigabytes of bz2 file becoming what? 100 of xml, then process it with beautiful soup and filters that he admits crash part way through each file and need to be restarted.

For your purpose (previous and following words) you could create a dictionary of real words and filter the ngram lists to exclude the mis-scanned words. One might hope that the scanning was good enough that you could exclude misscans by only taking the most popular words... But I saw some signs of constant mistakes.

The ngram datasets are here by the way http://books.google.com/ngrams/datasets

This site may have what you want http://www.wordfrequency.info/

Upvotes: 2

frazman

Reputation: 33243

You have got to give some more instances or context of "unseen" word so that the algorithm can make some inference. One indirect way can be reading rest of the words in the sentences.. and looking into a dictionary for the words where those words are encountered. In general, you cant expect the algorithm to learn and understand the inference in the first time. Think about yourself.. If you were given a new word.. how well can you make out its meaning (probably by looking into how it has been used in the sentence and how well your understanding is) but then you make an educated guess and over the period of time you understand the meaning.

Upvotes: 2

Fred Foo

Reputation: 363597

Any solution to this problem is bound to be a corpus-based method; you just need a bigger corpus. I'm not aware of any web service or library that is does this for you, but there are ways to obtain bigger corpora:

Google has published a huge corpus of n-grams collected from the English part of the web. It's available via the Linguistic Data Consortium (LDC), but I believe you must be an LDC member to obtain it. (Many universities are.)
If you're not an LDC member, try downloading a Wikipedia database dump (get enwiki) and training your predictor on that.
If you happen to be using Python, check out the nice set of corpora (and tools) delivered with NLTK.

As for the unseen words problem, there are ways to tackle it, e.g. by replacing all words that occur less often than some threshold by a special token like <unseen> prior to training. That will make your evaluation a bit harder.

Upvotes: 3

Words Prediction - Get most frequent predecessor and successor

Answers (3)

Related Questions