CurtLH
CurtLH

Reputation: 2417

Count of all hashtags in a set of tweets

I have some JSON Twitter data from the streaming API and I would like to use the Counter function to get an idea of the most popular hashtags in this dataset. The issue that I have is looping through tweets that have more than one hashtag and not just pulling out the first hashtag and ignoring any remaining hashtags.

Question: how do I loop through a nested list inside of a dict to extract all hashtags in a tweet and not just the first hashtag?

In [1]: import json

In [2]: from collections import Counter

In [3]: data = []

In [4]: for line in open('DC.json'):
   ...:     try:
   ...:         data.append(json.loads(line))
   ...:     except:
   ...:         pass
   ...:     

In [5]: hashtags = []

In [6]: for i in data:
   ...:     if 'entities' in i and len(i['entities']['hashtags']) > 0:
   ...:         hashtags.append(i['entities']['hashtags']['text'])
   ...:     else:
   ...:         pass
   ...:     
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-66d7538509f9> in <module>()
      1 for i in data:
      2     if 'entities' in i and len(i['entities']['hashtags']) > 0:
----> 3         hashtags.append(i['entities']['hashtags']['text'])
      4     else:
      5         pass

TypeError: list indices must be integers, not str

In [7]: Counter(hashtags).most_common()[:10]

Example with 4 hashtags in i['entities']['hashtags']

In [12]: i[0]['entities']['hashtags']
Out[12]: 
[{u'indices': [28, 35], u'text': u'selfie'},
 {u'indices': [82, 92], u'text': u'omg'},
 {u'indices': [93, 104], u'text': u'Champ'},
 {u'indices': [105, 117], u'text': u'FIRST'}]

Upvotes: 1

Views: 4564

Answers (1)

jonrsharpe
jonrsharpe

Reputation: 121975

You say that i['entities']['hashtags'] is a list of dicts, so the line:

hashtags.append(i['entities']['hashtags']['text'])

is trying to index a list using a string. This makes no sense, and causes an error. I think you would be better splitting this into steps, first getting all of the 'hashtag' dictionaries:

hashtags = []
for i in data:
    if 'entities' in i:
        hashtags.extend(i['entities']['hashtags'])

then extracting the 'text':

hashtags = [tag['text'] for tag in hashtags]

then dumping it into Counter:

Counter(hashtags).most_common()[:10]

Upvotes: 4

Related Questions