Martha Pears
Martha Pears

Reputation: 149

How to find log probability of bigrams using python?

I just created a bigram model in python as follows:

class BiGramModel:
    def __init__(self, trainfiles):
        wordlist = re.findall(r'\b\w+\b', trainfiles)
        wordlist = Counter(wordlist)
        for word, count in wordlist.items():
            if count == 1:
                trainfiles = re.sub(r'\b{}\b'.format(word), '<unk>', trainfiles)
        print trainfiles
        m = re.findall(r'(?=(\b\w+\b \S+))', trainfiles)
        print m

    def logprob(self,context, word): #Need to write code for this

def main():
    bi = BiGramModel("STOP to be or not to be STOP")
    bi.score("STOP to be or not to be STOP")

main()

Now, I need to calculate the log probability, logprob(self,context,word) of each entered context and word. I know that the formula to calculate log probability is log(count(event,context)+1)/(count(context)+V.

For eg:

for the sentence: `STOP to be <unk> <unk> to be STOP`

bigrams printed out: ['STOP to', 'to be', 'be <unk>', 'to be', 'be STOP']

Now if I do logprob(STOP,STOP), then: count(STOP,STOP) from above list is 0, V = # of distinct words in the sentence, which is 5, and count(STOP,STOP) = 0, hence log prob = log((0+1)/(0+5)) = log1/5

I am not able to figure out how to write a separate function for this such that it gets bigrams from the above init function. I'm stuck at this !!

Upvotes: 2

Views: 3814

Answers (1)

erewok
erewok

Reputation: 7835

I am not able to figure out how to write a separate function for this such that it gets bigrams from the above init function.

__init__ is the constructor for your class. Your class creates objects (it "instantiates" them) and __init__ defines what happens when those objects are created.

The idea of a class is that it sets out the blueprint for an object that contains some things (normally called 'state') and it also does some things ('behavior'). The way to make the object contain things is often to set those things as attributes when you are creating objects.

One way you can set attributes is by using that variable self, which you created in the definition of the constructor. For instance, you could save all of those things in __init__ so that your class methods could access them in the following way:

def __init__(self, trainfiles):
        wordlist = re.findall(r'\b\w+\b', trainfiles)
        self.wordlist = Counter(wordlist)
        for word, count in self.wordlist.items():
            if count == 1:
                self.trainfiles = re.sub(r'\b{}\b'.format(word), '<unk>', trainfiles)
        print self.trainfiles
        self.m = re.findall(r'(?=(\b\w+\b \S+))', self.trainfiles)

This means that any function called after __init__ (which means every method you define for your class, practically speaking), will have access to these things. You could then do, for instance:

def logprob(self,context, word):
    print self.m 
    print self.trainfiles

The design idea you are looking for, though, is not totally clear and that's why it's important to think about each object as a collection of state and behavior.

The real question is: what does a BiGramModel really do and what should each BiGramModel hold?

A supplemental question as well can inform the answer to the first: Are you creating many of these things or just one of them?

When you can answer that, you'll know what to save as attributes for the object and which methods to write.

More specific to your question, you mentioned that you know the formula you want:

log(count(event,context)+1)/(count(context)+V

It's not clear where these variables are going to come from, though. How are you calculating count? Are you passing that in? What's event?

Upvotes: 3

Related Questions