Reputation: 211
I am trying to reproduce some common nlp metrics with my own code, including Manning and Scheutze's t-test for collocational significance and chi-square test for collocational significance.
I call nltk.bigrams() on the following list of 24 tokens:
tokens = ['she', 'knocked', 'on', 'his', 'door', 'she', 'knocked', 'at',
'the', 'door','100', 'women', 'knocked', 'on', "Donaldson's", 'door', 'a',
'man', 'knocked', 'on', 'the', 'metal', 'front', 'door']`
I get 23 bigrams:
[('she', 'knocked'), ('knocked', 'on'), ('on', 'his'), ('his', 'door'), ('door', 'she'),
('she', 'knocked'), ('knocked', 'at'), ('at', 'the'), ('the', 'door'), ('door', '100'),
('100', 'women'), ('women', 'knocked'), ('knocked', 'on'), ('on', "Donaldson's"),
("Donaldson's", 'door'), ('door', 'a'), ('a', 'man'), ('man', 'knocked'),
('knocked', 'on'), ('on', 'the'), ('the', 'metal'), ('metal', 'front'), ('front',
'door')]`
If I want to determine the t statistic for ('she', 'knocked')
, I input:
#Total bigrams is 23
t = (2/23 - 4/23)/(math.sqrt(2/23/23))`
t = 1.16826337761`
However, when I try:
finder = BigramCollocationFinder.from_words(tokens)`
student_t = finder.score_ngrams(bigram_measures.student_t)`
student_t = (('she', 'knocked'), 1.178511301977579)`
When I turn the size of my bigram population to 24 (the length of the original list of tokens), I get the same answer as NLTK:
('she', 'knocked'): 1.17851130198
My question is really simple: what do I use for my population count for these hypothesis tests? The length of the tokenized list or the length of the bigram list? Or does the procedure count a terminal unit that does not output in the nltk.bigram() method?
Upvotes: 3
Views: 2098
Reputation: 122348
First we dig out the score_ngram()
from nltk.collocations.BigramCollocationFinder. See https://github.com/nltk/nltk/blob/develop/nltk/collocations.py:
def score_ngram(self, score_fn, w1, w2):
"""Returns the score for a given bigram using the given scoring
function. Following Church and Hanks (1990), counts are scaled by
a factor of 1/(window_size - 1).
"""
n_all = self.word_fd.N()
n_ii = self.ngram_fd[(w1, w2)] / (self.window_size - 1.0)
if not n_ii:
return
n_ix = self.word_fd[w1]
n_xi = self.word_fd[w2]
return score_fn(n_ii, (n_ix, n_xi), n_all)
Then we take a look at the student_t()
from nltk.metrics.association, see https://github.com/nltk/nltk/blob/develop/nltk/metrics/association.py:
### Indices to marginals arguments:
NGRAM = 0
"""Marginals index for the ngram count"""
UNIGRAMS = -2
"""Marginals index for a tuple of each unigram count"""
TOTAL = -1
"""Marginals index for the number of words in the data"""
def student_t(cls, *marginals):
"""Scores ngrams using Student's t test with independence hypothesis
for unigrams, as in Manning and Schutze 5.3.1.
"""
return ((marginals[NGRAM] -
_product(marginals[UNIGRAMS]) /
float(marginals[TOTAL] ** (cls._n - 1))) /
(marginals[NGRAM] + _SMALL) ** .5)
And _product()
and _SMALL
is:
_product = lambda s: reduce(lambda x, y: x * y, s)
_SMALL = 1e-20
So going back to your example:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
tokens = ['she', 'knocked', 'on', 'his', 'door', 'she', 'knocked', 'at',
'the', 'door','100', 'women', 'knocked', 'on', "Donaldson's", 'door', 'a',
'man', 'knocked', 'on', 'the', 'metal', 'front', 'door']
finder = BigramCollocationFinder.from_words(tokens)
bigram_measures = BigramAssocMeasures()
print finder.word_fd.N()
student_t = {k:v for k,v in finder.score_ngrams(bigram_measures.student_t)}
print student_t['she', 'knocked']
[out]:
24
1.17851130198
In NLTK, it takes the number of tokens as the population count, i.e. 24 . But I would say this is not usually how the student_t
test scores are calculated. I would have gone with #Ngrams rather than #Tokens, see nlp.stanford.edu/fsnlp/promo/colloc.pdf and www.cse.unt.edu/~rada/CSCE5290/Lectures/Collocations.ppt . But since the population is a constant, and when #Tokenis is >>>, i'm not sure whether the effect size of the difference accounts for much, since #Tokens = #Ngrams+1 for bigrams.
Let's continue in digging into how NLTK calculates the student_t.
So if we strip the student_t()
out and just put in the parameters, we get the same output:
import math
NGRAM = 0
"""Marginals index for the ngram count"""
UNIGRAMS = -2
"""Marginals index for a tuple of each unigram count"""
TOTAL = -1
"""Marginals index for the number of words in the data"""
_product = lambda s: reduce(lambda x, y: x * y, s)
_SMALL = 1e-20
def student_t(*marginals):
"""Scores ngrams using Student's t test with independence hypothesis
for unigrams, as in Manning and Schutze 5.3.1.
"""
_n = 2
return ((marginals[NGRAM] -
_product(marginals[UNIGRAMS]) /
float(marginals[TOTAL] ** (_n - 1))) /
(marginals[NGRAM] + _SMALL) ** .5)
ngram_freq = 2
w1_freq = 2
w2_freq = 4
total_num_words = 24
print student_t(ngram_freq, (w1_freq,w2_freq), total_num_words)
So we see that in NLTK
, the student_t
score for bigrams is calculated as such:
import math
(2 - 2*4/float(24)) / math.sqrt(2 + 1e-20)
in formula:
(ngram_freq - (w1_freq * w2_freq) / total_num_words) / sqrt(ngram_freq + 1e-20)
Upvotes: 6