acpigeon
acpigeon

Reputation: 1729

Implementing ngrams in Python

Creating a basic ngram implementation in Python as a personal challenge. Started with unigrams and worked up to trigrams:

def unigrams(text):
    uni = []
    for token in text:
        uni.append([token])
    return uni

def bigrams(text):
    bi = []
    token_address = 0
    for token in text[:len(text) - 1]:
        bi.append([token, text[token_address + 1]])
        token_address += 1
    return bi

def trigrams(text):
    tri = []
    token_address = 0
    for token in text[:len(text) - 2]:
        tri.append([token, text[token_address + 1], text[token_address + 2]])
        token_address += 1
    return tri

Now the fun part, generalize to n-grams. The main problem with generalizing the approach I have here is creating the list of length n that goes into the append method. I thought initially that lambdas might be a way to do it, but I can't figure out how.

Also, other implementations I'm looking at are taking an entirely different tack (no surprise), e.g. here and here, so I'm starting to wonder if I'm at a dead end.

Before I give up on this approach, I'm curious: 1) is there a one line or pythonic method of creating an arbitrary list size in this manner? 2) what are the downsides of approaching the problem this way?

Upvotes: 0

Views: 6450

Answers (3)

adikh
adikh

Reputation: 306

Try this.

  def get_ngrams(wordlist,n):
      ngrams = []
      for i in range(len(wordlist)-(n-1)):
          ngrams.append(wordlist[i:i+n])
      return ngrams

Upvotes: 0

jitendra
jitendra

Reputation: 1458

The following function should work for a general n-gram model.

def ngram(text,grams):  
    model=[]
    # model will contain n-gram strings
    count=0
    for token in text[:len(text)-grams+1]:  
       model.append(text[count:count+grams])  
       count=count+1  
    return model

Upvotes: 3

vermillon
vermillon

Reputation: 563

As a convenient one-liner:

def retrieve_ngrams(txt, n):
    return [txt[i:i+n] for i in range(len(txt)-(n-1))]

Upvotes: 1

Related Questions