user1971414
user1971414

Reputation: 61

Words generated from Text.similar() and ContextIndex.similar_words() in NLTK sorted by frequency?

I'm using these two functions to find similar words and they return different lists. I'm wondering if these functions are sorted by most to least frequent association?

Upvotes: 5

Views: 6233

Answers (1)

Richard
Richard

Reputation: 362

ContextIndex.similar_words(word) calculates the similarity score for each word as the sum of the products of frequencies in each context. Text.similar() simply counts the number of unique contexts the words share.

similar_words() seems to contain a bug in NLTK 2.0. See the definition in nltk/text.py:

def similar_words(self, word, n=20):
    scores = defaultdict(int)
    for c in self._word_to_contexts[self._key(word)]:
        for w in self._context_to_words[c]:
            if w != word:
                print w, c, self._context_to_words[c][word], self._context_to_words[c][w]
                scores[w] += self._context_to_words[c][word] * self._context_to_words[c][w]
    return sorted(scores, key=scores.get)[:n]

The returned word list should be sorted in descending order of similarity score. Replace the return statement with:

return sorted(scores, key=scores.get)[::-1][:n]

In similar(), the call to similar_words() is commented out, perhaps due to this bug.

def similar(self, word, num=20):
    if '_word_context_index' not in self.__dict__:
        print 'Building word-context index...'
        self._word_context_index = ContextIndex(self.tokens,
                                                filter=lambda x:x.isalpha(),
                                                key=lambda s:s.lower())

#   words = self._word_context_index.similar_words(word, num)

    word = word.lower()
    wci = self._word_context_index._word_to_contexts
    if word in wci.conditions():
        contexts = set(wci[word])
        fd = FreqDist(w for w in wci.conditions() for c in wci[w]
                      if c in contexts and not w == word)
        words = fd.keys()[:num]
        print tokenwrap(words)
    else:
        print "No matches"

Note: in a FreqDist, unlike a dict, keys() returns a sorted list.

Example:

import nltk

text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')

similar_words = text._word_context_index.similar_words('woman')
print ' '.join(similar_words)

Output:

man day time year car moment world family house boy child country
job state girl place war way case question   # Text.similar()

#man ('a', 'who') 9 39   # output from similar_words(); see following explanation
#girl ('a', 'who') 9 6
#[...]

man number time world fact end year state house way day use part
kind boy matter problem result girl group   # ContextIndex.similar_words()

fd, the frequency distribution in similar(), is a tally of the number of contexts for each word:

fd = [('man', 52), ('day', 30), ('time', 30), ('year', 28), ('car', 24), ('moment', 24), ('world', 23) ...]

For each word in each context, similar_words() calculates the sum of the product of the frequencies:

man ('a', 'who') 9 39  # 'a man who' occurs 39 times in text;
                       # 'a woman who' occurs 9 times
                       # Similarity score for the context is the product:
                       #     score['man'] = 9 * 39
girl ('a', 'who') 9 6
writer ('a', 'who') 9 4
boy ('a', 'who') 9 3
child ('a', 'who') 9 2
dealer ('a', 'who') 9 2
...
man ('a', 'and') 6 11  # score += 6 * 11
...
man ('a', 'he') 4 6    # score += 4 * 6
...
[49 more occurrences of 'man']

Upvotes: 4

Related Questions