Reputation: 7338
Say I have a list of tuples, top_n
, of the top n
most common bigrams found in a corpus of text:
import nltk
from nltk import bigrams
from nltk import FreqDist
bi_grams = bigrams(text) # text is a list of strings (tokens)
fdistBigram = FreqDist(bi_grams)
n = 300
top_n= [list(t) for t in zip(*fdistBigram.most_common(n))][0]; top_n
>>> [('let', 'us'),
('us', 'know'),
('as', 'possible')
....
Now I want to replace instances of sets of words that are bigrams in top_n
with their concatenation in place. For example, say we have a new variable query
which is a list of strings:
query = ['please','let','us','know','as','soon','as','possible']
would become
['please','letus', 'usknow', 'as', 'soon', 'aspossible']
after the desired operation. More explicitly, I want to search every element of query
and check if the ith and (i+1)th element are in top_n
; if they are, then replace query[i]
and query[i+1]
with a single concatenated bigram i.e (query[i], query[i+1]) -> query[i] + query[i+1]
.
Is there some way to do this using NLTK, or what would be the best way to do this if looping over each word in query
is necessary?
Upvotes: 1
Views: 1691
Reputation: 7338
Alternative answer:
from gensim.models.phrases import Phraser
from gensim.models import Phrases
phrases = Phrases(text, min_count=1500, threshold=0.01)
bigram = Phraser(phrases)
bigram[query]
>>> ['please', 'let_us', 'know', 'as', 'soon', 'as', 'possible']
Not exactly the desired output desired in the question, but it works as an alternative. The inputs min_count
and threshold
will strongly influence the output. Thanks to this question here.
Upvotes: 0
Reputation: 20245
Given your code and the query, where words will be greedily replaced with their bi-grams if they were in the top_n
, this will do the trick:
lookup = set(top_n) # {('let', 'us'), ('as', 'soon')}
query = ['please', 'let', 'us', 'know', 'as', 'soon', 'as', 'possible']
answer = []
q_iter = iter(range(len(query)))
for idx in q_iter:
answer.append(query[idx])
if idx < (len(query) - 1) and (query[idx], query[idx+1]) in lookup:
answer[-1] += query[idx+1]
next(q_iter)
# if you don't want to skip over consumed
# second bi-gram elements and keep
# len(query) == len(answer), don't advance
# the iterator here, which also means you
# don't have to create the iterator in outer scope
print(answer)
Results in (for example):
>> ['please', 'letus', 'know', 'assoon', 'as', 'possible']
Upvotes: 2