Reputation: 847
I have a specific coding question in python.
Count = defaultdict(int)
for l in text:
for m in l['reviews'].split():
Count[m] += 1
print Count
The text
is a list that looks like following
[{'ideology': 3.4,
'ID': '50555',
'reviews': 'Politician from CA-21, very liberal and aggressive'},{'ideology': 1.5,
'ID': '10223'
'reviews': 'Retired politician'}, ...]
If I run this code, I get a result like this:
defaultdict(<type 'int'>, {'superficial,': 2, 'awesome': 1,
'interesting': 3, 'A92': 2, ....
What I want to get is a bigram count, instead of unigram count. I tried following code, but I get an error TypeError: cannot concatenate 'str' and 'int' objects
Count = defaultdict(int)
for l in text:
for m in l['reviews'].split():
Count[m, m+1] += 1
I want to use a similar code like this instead of using other codes that already exist in Stackoverflow. Most of the existing codes use word list, but I want to count bigrams directly from the split() which come from the original text.
I want to get a result similar like this:
defaultdict(<type 'int'>, {('superficial', 'awesome'): 1, ('awesome, interesting'): 1,
('interesting','A92'): 2, ....}
Why do I get an error and how do I fix this code?
Upvotes: 1
Views: 3384
Reputation: 444
There is solution for counting objects in standard library, called Counter
.
Also, with the help of itertools
, your bigram counter script can look like this:
from collections import Counter, defaultdict
from itertools import izip, tee
#function from 'recipes section' in standard documentation itertools page
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return izip(a, b)
text = [{'ideology': 3.4, 'ID': '50555',
'reviews': 'Politician from CA-21, very liberal and aggressive'},
{'ideology': 1.5, 'ID': '10223',
'reviews': 'Retired politician'} ]
c = Counter()
for l in text:
c.update(pairwise(l['reviews'].split()))
print c.items()
Upvotes: 2
Reputation: 387
Do you want to count the number of each two adjacent words ? Make them a tuple.
text = [{'ideology':3.4, 'ID':'50555', 'reviews':'Politician from CA-21, very liberal and aggressive'}]
Count = {}
for l in text:
words = l['reviews'].split()
for i in range(len(words)-1):
if not (words[i],words[i+1]) in Count:
Count[(words[i],words[i+1])] = 0
Count[(words[i],words[i+1])] += 1
print Count
result:
{('and', 'aggressive'): 1, ('from', 'CA-21,'): 1, ('Politician', 'from'): 1, ('CA-21,', 'very'): 1, ('very', 'liberal'): 1, ('liberal', 'and'): 1}
Upvotes: 0
Reputation: 3961
If i understand your question correctly, below codes solve your problem.
Count = dict()
for l in text:
words = l['reviews'].split()
for i in range(0,len(words) -1):
bigram = " ".join(words[i:i+2] )
if not bigram in Count:
Count[bigram] = 1;
else:
Count[bigram] = Count[bigram] + 1
Count would be:
> {'CA-21, very': 1, 'liberal and': 1, 'very liberal': 1, 'and
> aggressive': 1, 'Politician from': 1, 'aggressive Politician': 1,
> 'from CA-21,': 1}
Edit:if you want to use key as tuple just change the join line. python dict hashes tuples too.
Upvotes: 1