Reputation: 2417
I have some JSON Twitter data from the streaming API and I would like to use the Counter
function to get an idea of the most popular hashtags in this dataset. The issue that I have is looping through tweets that have more than one hashtag and not just pulling out the first hashtag and ignoring any remaining hashtags.
Question: how do I loop through a nested list inside of a dict to extract all hashtags in a tweet and not just the first hashtag?
In [1]: import json
In [2]: from collections import Counter
In [3]: data = []
In [4]: for line in open('DC.json'):
...: try:
...: data.append(json.loads(line))
...: except:
...: pass
...:
In [5]: hashtags = []
In [6]: for i in data:
...: if 'entities' in i and len(i['entities']['hashtags']) > 0:
...: hashtags.append(i['entities']['hashtags']['text'])
...: else:
...: pass
...:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-66d7538509f9> in <module>()
1 for i in data:
2 if 'entities' in i and len(i['entities']['hashtags']) > 0:
----> 3 hashtags.append(i['entities']['hashtags']['text'])
4 else:
5 pass
TypeError: list indices must be integers, not str
In [7]: Counter(hashtags).most_common()[:10]
Example with 4 hashtags in i['entities']['hashtags']
In [12]: i[0]['entities']['hashtags']
Out[12]:
[{u'indices': [28, 35], u'text': u'selfie'},
{u'indices': [82, 92], u'text': u'omg'},
{u'indices': [93, 104], u'text': u'Champ'},
{u'indices': [105, 117], u'text': u'FIRST'}]
Upvotes: 1
Views: 4564
Reputation: 121975
You say that i['entities']['hashtags']
is a list
of dict
s, so the line:
hashtags.append(i['entities']['hashtags']['text'])
is trying to index a list using a string. This makes no sense, and causes an error. I think you would be better splitting this into steps, first getting all of the 'hashtag'
dictionaries:
hashtags = []
for i in data:
if 'entities' in i:
hashtags.extend(i['entities']['hashtags'])
then extracting the 'text'
:
hashtags = [tag['text'] for tag in hashtags]
then dumping it into Counter
:
Counter(hashtags).most_common()[:10]
Upvotes: 4