Jeff Singleton
Jeff Singleton

Reputation: 33

Counting bigrams from user input in python 3?

I'm stuck and need a little guidance. I'm trying hard to learn Python on my own using Grok Learning. Below is the Problem and example output along with where I am in the code. I appreciate any tips that will help me solve this problem.

In linguistics, a bigram is a pair of adjacent words in a sentence. The sentence "The big red ball." has three bigrams: The big, big red, and red ball.

Write a program to read in multiple lines of input from the user, where each line is a space-separated sentence of words. Your program should then count up how many times each of the bigrams occur across all input sentences. The bigrams should be treated in a case insensitive manner by converting the input lines to lowercase. Once the user stops entering input, your program should print out each of the bigrams that appear more than once, along with their corresponding frequencies. For example:

Line: The big red ball
Line: The big red ball is near the big red box
Line: I am near the box
Line: 
near the: 2
red ball: 2
the big: 3
big red: 3

I haven't gotten very far with my code and am really stuck. But here is where I am:

words = set()
line = input("Line: ")
while line != '':
  words.add(line)
  line = input("Line: ")

Am I even doing this right? Try not to import any modules and just use built-in functionality.

Thanks, Jeff

Upvotes: 3

Views: 4375

Answers (4)

C2H6O
C2H6O

Reputation: 1

Using only knowledge learnt from previous modules of the Grok Learning Python Course mentioned by the OP, this code works fine in performing what was required:

counts = {} # this creates a dictionary for the bigrams and the tally for each one
n = 2
a = input('Line: ').lower().split() # the input is converted into lowercase, then split into a list
while a:
  for x in range(n, len(a)+1):
    b = tuple(a[x-2:x]) # the input gets sliced into pairs of two words (bigrams)
    counts[b] = counts.get(b,0) + 1 # adding the bigrams as keys to the dictionary, with their count value set to 1 initially, then increased by 1 thereafter
  a = input('Line: ').lower().split()  
for c in counts:
  if counts[c] > 1: # tests if the bigram occurs more than once
    print(' '.join(c) + ':', counts[c]) # prints the bigram (making sure to convert the key from a tuple into a string), with the count next to it

Note: You may need to scroll right to fully view the notes made on code.

It is pretty straightforward and doesn't require importing anything, etc. I realise I am quite late to the conversation, but hopefully anyone else doing the same course / struggling with similar problems will find this answer helpful.

Upvotes: 0

randomir
randomir

Reputation: 18707

Let's start with the function that receives a sentence (with punctuation) and returns a list of all lowercase bigrams found.

So, we first need to strip all non-alphanumerics from the sentence, convert all letters to lowercase counterparts, and then split the sentence by spaces into a list of words:

import re

def bigrams(sentence):
    text = re.sub('\W', ' ', sentence.lower())
    words = text.split()
    return zip(words, words[1:])

We'll use the standard (builtin) re package for regular expression based substitution of non-alphanumerics with spaces, and the builtin zip function to pair-up consecutive words. (We pair the list of words with the same list, but shifted by one element.)

Now we can test it:

>>> bigrams("The big red ball")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]
>>> bigrams("THE big, red, ball.")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]
>>> bigrams(" THE  big,red,ball!!?")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]

Next, for counting bigrams found in each sentence, you might use collections.Counter.

For example, like this:

from collections import Counter

counts = Counter()
for line in ["The big red ball", "The big red ball is near the big red box", "I am near the box"]:
    counts.update(bigrams(line))

We get:

>>> Counter({('the', 'big'): 3, ('big', 'red'): 3, ('red', 'ball'): 2, ('near', 'the'): 2, ('red', 'box'): 1, ('i', 'am'): 1, ('the', 'box'): 1, ('ball', 'is'): 1, ('am', 'near'): 1, ('is', 'near'): 1})

Now we just need to print those that appear more than once:

for bigr, cnt in counts.items():
    if cnt > 1:
        print("{0[0]} {0[1]}: {1}".format(bigr, cnt))

All put together, with a loop for user input, instead of the fixed list:

import re
from collections import Counter

def bigrams(sentence):
    text = re.sub('\W', ' ', sentence.lower())
    words = text.split()
    return zip(words, words[1:])

counts = Counter()
while True:
    line = input("Line: ")
    if not line:
        break
    counts.update(bigrams(line))

for bigr, cnt in counts.items():
    if cnt > 1:
        print("{0[0]} {0[1]}: {1}".format(bigr, cnt))

The output:

Line: The big red ball
Line: The big red ball is near the big red box
Line: I am near the box
Line: 
near the: 2
red ball: 2
big red: 3
the big: 3

Upvotes: 5

A Magoon
A Magoon

Reputation: 1210

usr_input = "Here is a sentence without multiple bigrams. Without multiple bigrams, we cannot test a sentence."

def get_bigrams(word_string):
    words = [word.lower().strip(',.') for word in word_string.split(" ")]
    pairs = ["{} {}".format(w, words[i+1]) for i, w in enumerate(words) if i < len(words) - 1]
    bigrams = {}

    for bg in pairs:
        if bg not in bigrams:
            bigrams[bg] = 0
        bigrams[bg] += 1
    return bigrams

print(get_bigrams(usr_input))

Upvotes: 0

inspectorG4dget
inspectorG4dget

Reputation: 114035

words = []
while True:
    line = input("Line: ").strip().lower()
    if not line: break
    words.extend(line.split())


counts = {}
for t in zip(words[::2], words[1::2]):
    if t not in counts: counts[t] = 0
    counts[t] += 1

Upvotes: 2

Related Questions