JeyuLeoChou
JeyuLeoChou

Reputation: 27

Counting Bigrams in a string not using NLTK

I have been trying to create a code that can see how many times a bigram appear within a string (if you don't know, bigram is with two words, such as 'if you' or 'you don't'). I tried to use the .join function with cutting lists, however, it just returns only one word and not two.

I used the .join function and use a for loop that will continue until n-1 (where n is the length of words) time, and that it will join two lists with a space from n-1 and n.

content_string = "This is a test to see whether or not this can         
effectively create bigrams"
words = content_string.lower()
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '()','-']
words = "".join(i if i not in punctuation else "" for i in words)
words = words.split()

n=1
number = len(words)-1
for n in range(number):
    print(" ".join(words[n-1:n]))

The expected result is that it can have bigrams produced, but the actual result that appear are only unigrams (although, funnily enough, when I try to use a dictionary and put the bigram as key and the number of times it appear as the value, the key is still a unigram, but the value becomes twice the number compared to originally just counting unigrams). What are some possible option without importing the NLTK library?

Upvotes: 0

Views: 318

Answers (1)

Dani Mesejo
Dani Mesejo

Reputation: 61930

If you want to count the bigrams I suggest you use collections.Counter, just change the last part of your code:

bigrams = Counter(zip(words, words[1:]))
print(bigrams)

Output

Counter({('this', 'is'): 1, ('is', 'a'): 1, ('a', 'test'): 1, ('test', 'to'): 1, ('to', 'see'): 1, ('see', 'whether'): 1, ('whether', 'or'): 1, ('or', 'not'): 1, ('not', 'this'): 1, ('this', 'can'): 1, ('can', 'effectively'): 1, ('effectively', 'create'): 1, ('create', 'bigrams'): 1})

The key here is to get the bigrams by zipping words with itself shifted by 1 (zip(words, words[1:])). If you want the bigrams as a string and not a tuple, do:

bigrams = Counter(' '.join(bigram) for bigram in zip(words, words[1:]))

Output

Counter({'this is': 1, 'is a': 1, 'a test': 1, 'test to': 1, 'to see': 1, 'see whether': 1, 'whether or': 1, 'or not': 1, 'not this': 1, 'this can': 1, 'can effectively': 1, 'effectively create': 1, 'create bigrams': 1})

Upvotes: 1

Related Questions