Reputation: 33
I'm stuck and need a little guidance. I'm trying hard to learn Python on my own using Grok Learning. Below is the Problem and example output along with where I am in the code. I appreciate any tips that will help me solve this problem.
In linguistics, a bigram is a pair of adjacent words in a sentence. The sentence "The big red ball." has three bigrams: The big, big red, and red ball.
Write a program to read in multiple lines of input from the user, where each line is a space-separated sentence of words. Your program should then count up how many times each of the bigrams occur across all input sentences. The bigrams should be treated in a case insensitive manner by converting the input lines to lowercase. Once the user stops entering input, your program should print out each of the bigrams that appear more than once, along with their corresponding frequencies. For example:
Line: The big red ball Line: The big red ball is near the big red box Line: I am near the box Line: near the: 2 red ball: 2 the big: 3 big red: 3
I haven't gotten very far with my code and am really stuck. But here is where I am:
words = set()
line = input("Line: ")
while line != '':
words.add(line)
line = input("Line: ")
Am I even doing this right? Try not to import any modules and just use built-in functionality.
Thanks, Jeff
Upvotes: 3
Views: 4375
Reputation: 1
Using only knowledge learnt from previous modules of the Grok Learning Python Course mentioned by the OP, this code works fine in performing what was required:
counts = {} # this creates a dictionary for the bigrams and the tally for each one
n = 2
a = input('Line: ').lower().split() # the input is converted into lowercase, then split into a list
while a:
for x in range(n, len(a)+1):
b = tuple(a[x-2:x]) # the input gets sliced into pairs of two words (bigrams)
counts[b] = counts.get(b,0) + 1 # adding the bigrams as keys to the dictionary, with their count value set to 1 initially, then increased by 1 thereafter
a = input('Line: ').lower().split()
for c in counts:
if counts[c] > 1: # tests if the bigram occurs more than once
print(' '.join(c) + ':', counts[c]) # prints the bigram (making sure to convert the key from a tuple into a string), with the count next to it
Note: You may need to scroll right to fully view the notes made on code.
It is pretty straightforward and doesn't require importing anything, etc. I realise I am quite late to the conversation, but hopefully anyone else doing the same course / struggling with similar problems will find this answer helpful.
Upvotes: 0
Reputation: 18707
Let's start with the function that receives a sentence (with punctuation) and returns a list of all lowercase bigrams found.
So, we first need to strip all non-alphanumerics from the sentence, convert all letters to lowercase counterparts, and then split the sentence by spaces into a list of words:
import re
def bigrams(sentence):
text = re.sub('\W', ' ', sentence.lower())
words = text.split()
return zip(words, words[1:])
We'll use the standard (builtin) re
package for regular expression based substitution of non-alphanumerics with spaces, and the builtin zip
function to pair-up consecutive words. (We pair the list of words with the same list, but shifted by one element.)
Now we can test it:
>>> bigrams("The big red ball")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]
>>> bigrams("THE big, red, ball.")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]
>>> bigrams(" THE big,red,ball!!?")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]
Next, for counting bigrams found in each sentence, you might use collections.Counter
.
For example, like this:
from collections import Counter
counts = Counter()
for line in ["The big red ball", "The big red ball is near the big red box", "I am near the box"]:
counts.update(bigrams(line))
We get:
>>> Counter({('the', 'big'): 3, ('big', 'red'): 3, ('red', 'ball'): 2, ('near', 'the'): 2, ('red', 'box'): 1, ('i', 'am'): 1, ('the', 'box'): 1, ('ball', 'is'): 1, ('am', 'near'): 1, ('is', 'near'): 1})
Now we just need to print those that appear more than once:
for bigr, cnt in counts.items():
if cnt > 1:
print("{0[0]} {0[1]}: {1}".format(bigr, cnt))
All put together, with a loop for user input, instead of the fixed list:
import re
from collections import Counter
def bigrams(sentence):
text = re.sub('\W', ' ', sentence.lower())
words = text.split()
return zip(words, words[1:])
counts = Counter()
while True:
line = input("Line: ")
if not line:
break
counts.update(bigrams(line))
for bigr, cnt in counts.items():
if cnt > 1:
print("{0[0]} {0[1]}: {1}".format(bigr, cnt))
The output:
Line: The big red ball
Line: The big red ball is near the big red box
Line: I am near the box
Line:
near the: 2
red ball: 2
big red: 3
the big: 3
Upvotes: 5
Reputation: 1210
usr_input = "Here is a sentence without multiple bigrams. Without multiple bigrams, we cannot test a sentence."
def get_bigrams(word_string):
words = [word.lower().strip(',.') for word in word_string.split(" ")]
pairs = ["{} {}".format(w, words[i+1]) for i, w in enumerate(words) if i < len(words) - 1]
bigrams = {}
for bg in pairs:
if bg not in bigrams:
bigrams[bg] = 0
bigrams[bg] += 1
return bigrams
print(get_bigrams(usr_input))
Upvotes: 0
Reputation: 114035
words = []
while True:
line = input("Line: ").strip().lower()
if not line: break
words.extend(line.split())
counts = {}
for t in zip(words[::2], words[1::2]):
if t not in counts: counts[t] = 0
counts[t] += 1
Upvotes: 2