Reputation: 27
I have been trying to create a code that can see how many times a bigram appear within a string (if you don't know, bigram is with two words, such as 'if you' or 'you don't'). I tried to use the .join function with cutting lists, however, it just returns only one word and not two.
I used the .join function and use a for loop that will continue until n-1 (where n is the length of words) time, and that it will join two lists with a space from n-1 and n.
content_string = "This is a test to see whether or not this can
effectively create bigrams"
words = content_string.lower()
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '()','-']
words = "".join(i if i not in punctuation else "" for i in words)
words = words.split()
n=1
number = len(words)-1
for n in range(number):
print(" ".join(words[n-1:n]))
The expected result is that it can have bigrams produced, but the actual result that appear are only unigrams (although, funnily enough, when I try to use a dictionary and put the bigram as key and the number of times it appear as the value, the key is still a unigram, but the value becomes twice the number compared to originally just counting unigrams). What are some possible option without importing the NLTK library?
Upvotes: 0
Views: 318
Reputation: 61930
If you want to count the bigrams I suggest you use collections.Counter, just change the last part of your code:
bigrams = Counter(zip(words, words[1:]))
print(bigrams)
Output
Counter({('this', 'is'): 1, ('is', 'a'): 1, ('a', 'test'): 1, ('test', 'to'): 1, ('to', 'see'): 1, ('see', 'whether'): 1, ('whether', 'or'): 1, ('or', 'not'): 1, ('not', 'this'): 1, ('this', 'can'): 1, ('can', 'effectively'): 1, ('effectively', 'create'): 1, ('create', 'bigrams'): 1})
The key here is to get the bigrams by zipping words with itself shifted by 1 (zip(words, words[1:])
). If you want the bigrams as a string and not a tuple, do:
bigrams = Counter(' '.join(bigram) for bigram in zip(words, words[1:]))
Output
Counter({'this is': 1, 'is a': 1, 'a test': 1, 'test to': 1, 'to see': 1, 'see whether': 1, 'whether or': 1, 'or not': 1, 'not this': 1, 'this can': 1, 'can effectively': 1, 'effectively create': 1, 'create bigrams': 1})
Upvotes: 1