Reputation: 303
Problem
I have a problem where I have one word and certain restrictions on what the second might be (for example "I _o__"). What I want is a list of words like "rode", "love", and "most" and telling me how common each one is following "I".
I want to be able to get a list of two-tuples (nextword, probability) where nextword is a word that satisfies a regex and probability is the chance that nextword follows after the first word, given by (number of times it is seen after the first word in a corpus of text)/(number of times the first word appears).
Like this:
[(nextword, follow_probability("I", nextword) for nextword in findwords('.o..')]
My approach to this is to first generate a list of possible words that satisfy the regex, and then look up the probability of each. The first part is easy, but I don't know how to do the second part. Ideally I would be able to have a function taking an argument for each word and returning the probability the second follows the first.
What I Have Tried
Upvotes: 0
Views: 1029
Reputation: 593
Try something like this:
from collections import Counter, deque
from nltk.tokenize import regexp_tokenize
import pandas as pd
def grouper(iterable, length=2):
i = iter(iterable)
q = deque(map(next, [i] * length))
while True:
yield tuple(q)
try:
q.append(next(i))
q.popleft()
except StopIteration:
break
def tokenize(text):
return [word.lower() for word in regexp_tokenize(text, r'\w+')]
def follow_probability(word1, word2, vec):
subvec = vec.loc[word1]
try:
ct = subvec.loc[word2]
except:
ct = 0
return float(ct) / (subvec.sum() or 1)
text = 'This is some training text this this'
tokens = tokenize(text)
markov = list(grouper(tokens))
vec = pd.Series(Counter(markov))
follow_probability('this', 'is', vec)
Output:
0.5
Upvotes: 2