user3077008
user3077008

Reputation: 847

How to count bigrams using a loop in python

I have a specific coding question in python.

Count = defaultdict(int)
for l in text:
   for m in l['reviews'].split():
      Count[m] += 1

print Count

The text is a list that looks like following

[{'ideology': 3.4,
 'ID': '50555',
 'reviews': 'Politician from CA-21, very liberal and aggressive'},{'ideology': 1.5,
 'ID': '10223'
 'reviews': 'Retired politician'}, ...]

If I run this code, I get a result like this:

defaultdict(<type 'int'>, {'superficial,': 2, 'awesome': 1, 
'interesting': 3, 'A92': 2, ....

What I want to get is a bigram count, instead of unigram count. I tried following code, but I get an error TypeError: cannot concatenate 'str' and 'int' objects

Count = defaultdict(int)
for l in text:
    for m in l['reviews'].split():
       Count[m, m+1] += 1

I want to use a similar code like this instead of using other codes that already exist in Stackoverflow. Most of the existing codes use word list, but I want to count bigrams directly from the split() which come from the original text.

I want to get a result similar like this:

defaultdict(<type 'int'>, {('superficial', 'awesome'): 1, ('awesome, interesting'): 1, 
('interesting','A92'): 2, ....}

Why do I get an error and how do I fix this code?

Upvotes: 1

Views: 3384

Answers (3)

merletta
merletta

Reputation: 444

There is solution for counting objects in standard library, called Counter. Also, with the help of itertools, your bigram counter script can look like this:

from collections import Counter, defaultdict
from itertools import izip, tee

#function from 'recipes section' in standard documentation itertools page
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

text = [{'ideology': 3.4, 'ID': '50555',
 'reviews': 'Politician from CA-21, very liberal and aggressive'},
 {'ideology': 1.5, 'ID': '10223',
 'reviews': 'Retired politician'} ]

c = Counter()
for l in text:
   c.update(pairwise(l['reviews'].split()))

print c.items()

Upvotes: 2

W. Steve
W. Steve

Reputation: 387

Do you want to count the number of each two adjacent words ? Make them a tuple.

text = [{'ideology':3.4, 'ID':'50555', 'reviews':'Politician from CA-21, very liberal and aggressive'}]
Count = {}
for l in text:
   words = l['reviews'].split()
   for i in range(len(words)-1):
        if not (words[i],words[i+1]) in Count:
                Count[(words[i],words[i+1])] = 0
        Count[(words[i],words[i+1])] += 1

print Count

result:

{('and', 'aggressive'): 1, ('from', 'CA-21,'): 1, ('Politician', 'from'): 1, ('CA-21,', 'very'): 1, ('very', 'liberal'): 1, ('liberal', 'and'): 1}

Upvotes: 0

RockOnGom
RockOnGom

Reputation: 3961

If i understand your question correctly, below codes solve your problem.

 Count = dict()
    for l in text:
        words = l['reviews'].split()
        for i in range(0,len(words) -1):
            bigram  = " ".join(words[i:i+2] )
            if not bigram  in Count:
                Count[bigram] = 1;
            else:
                Count[bigram] = Count[bigram] + 1

Count would be:

> {'CA-21, very': 1, 'liberal and': 1, 'very liberal': 1, 'and
> aggressive': 1, 'Politician from': 1, 'aggressive Politician': 1,
> 'from CA-21,': 1}

Edit:if you want to use key as tuple just change the join line. python dict hashes tuples too.

Upvotes: 1

Related Questions