Node.JS
Node.JS

Reputation: 1582

New to nltk, having trouble with conditional frequency

I am very new in python and nltk (I started 2 hours ago). Here is what I am asked to do:

Write a function GetAmbigousWords(corpus, N) that finds words in the corpus with more than N observed tags. This function should return a ConditionalFreqDist object where the conditions are the words and the frequency distribution indicates the tag frequencies for each word.

Here is what I have done so far:

def GetAmbiguousWords(corpus, number):
conditional_frequency = ConditionalFreqDist()
word_tag_dict = defaultdict(set)       # Creates a dictionary of sets
for (word, tag) in corpus:
    word_tag_dict[word].add(tag)

for taggedWord in word_tag_dict:
    if ( len(word_tag_dict[taggedWord]) >= number ):
        condition = taggedWord
        conditional_frequency[condition] # do something, I don't know what to do

return conditional_frequency

e.g. Here is how the function should behave:

GetAmbiguousWords(nltk.corpus.brown.tagged_words(categories='news'), 4)

I am wondering am I on the right track or completely off? In particular, I don't really understand conditional frequency.

Thanks in advance.

Upvotes: 0

Views: 5324

Answers (1)

D-rk
D-rk

Reputation: 5919

With a frequency distribution, you can collect how frequently a word occurred in a text:

text = "cow cat mouse cat tiger"

fDist = FreqDist(word_tokenize(text))

for word in fDist:
    print "Frequency of", word, fDist.freq(word)

This will result in:

Frequency of tiger 0.2
Frequency of mouse 0.2
Frequency of cow 0.2
Frequency of cat 0.4

Now, a conditional frequency is basically the same but you add a condition under which you group the frequencies. E.g. group it by word length:

cfdist = ConditionalFreqDist()

for word in word_tokenize(text):
    condition = len(word)
    cfdist[condition][word] += 1

for condition in cfdist:
    for word in cfdist[condition]:
        print "Cond. frequency of", word, cfdist[condition].freq(word), "[condition is word length =", condition, "]"

This will print:

Cond. frequency of cow 0.333333333333 [condition is word length = 3 ]
Cond. frequency of cat 0.666666666667 [condition is word length = 3 ]
Cond. frequency of tiger 0.5 [condition is word length = 5 ]
Cond. frequency of mouse 0.5 [condition is word length = 5 ]

Hope that helps.

Upvotes: 6

Related Questions