Reputation: 1729
Creating a basic ngram implementation in Python as a personal challenge. Started with unigrams and worked up to trigrams:
def unigrams(text):
uni = []
for token in text:
uni.append([token])
return uni
def bigrams(text):
bi = []
token_address = 0
for token in text[:len(text) - 1]:
bi.append([token, text[token_address + 1]])
token_address += 1
return bi
def trigrams(text):
tri = []
token_address = 0
for token in text[:len(text) - 2]:
tri.append([token, text[token_address + 1], text[token_address + 2]])
token_address += 1
return tri
Now the fun part, generalize to n-grams. The main problem with generalizing the approach I have here is creating the list of length n that goes into the append method. I thought initially that lambdas might be a way to do it, but I can't figure out how.
Also, other implementations I'm looking at are taking an entirely different tack (no surprise), e.g. here and here, so I'm starting to wonder if I'm at a dead end.
Before I give up on this approach, I'm curious: 1) is there a one line or pythonic method of creating an arbitrary list size in this manner? 2) what are the downsides of approaching the problem this way?
Upvotes: 0
Views: 6450
Reputation: 306
Try this.
def get_ngrams(wordlist,n):
ngrams = []
for i in range(len(wordlist)-(n-1)):
ngrams.append(wordlist[i:i+n])
return ngrams
Upvotes: 0
Reputation: 1458
The following function should work for a general n-gram model.
def ngram(text,grams):
model=[]
# model will contain n-gram strings
count=0
for token in text[:len(text)-grams+1]:
model.append(text[count:count+grams])
count=count+1
return model
Upvotes: 3
Reputation: 563
As a convenient one-liner:
def retrieve_ngrams(txt, n):
return [txt[i:i+n] for i in range(len(txt)-(n-1))]
Upvotes: 1