Reputation: 502
I am having some trouble with this. I am trying to write this JSON to DataFrame. I feel like my issue is how i am formatting the JSON. When i write each tweet. However not able to narrow it down. Any insight would be awesome. Attached is my raw_tweets.json and 2nd code blow below is how i am writing it, seperating by comma i.e join (',')
HERE is the LINK TO raw_tweets.json
i get a raise JSONDecodeError("Extra data", s, end)
JSONDecodeError: Extra data
#JSON to DataFrame
class tweet2dframe(object):
def __init__(self, text="", location=""):
self.text = text
self.location = location
def getText(self):
return self.text
def getLocation(self):
return self.location
# import json package to load json file
with open('raw_tweets.json',encoding="utf8") as jsonFile:
polls_json = json.loads(jsonFile.read())
tweets_list = [polls(i["location"], i["text"]) for i in polls_json['text']]
colNames = ("Text", "location")
dict_list = []
for i in tweets_list:
dict_list.append(dict(zip(colNames , [i.getText(), i.getLocation()])))
tweets_df = pd.DataFrame(dict_list)
tweets_df.head()
THE way I write my tweets to JSON
saveFile = io.open('raw_tweets.json', 'w', encoding='utf-8')
saveFile.write(','.join(self.tweet_data))
saveFile.close()
exit()
Upvotes: 0
Views: 2421
Reputation: 880877
raw_tweets.json
contains invalid JSON. It contains JSON snippets separated by commas. To make the whole text a valid JSON array, place brackets [...]
around the contents:
with open('raw_tweets.json', encoding="utf8") as jsonFile:
polls_json = json.loads('[{}]'.format(jsonFile.read()))
For example,
import pandas as pd
import json
with open('raw_tweets.json', encoding="utf8") as jsonFile:
polls_json = json.loads('[{}]'.format(jsonFile.read()))
tweets_list = [(dct['user']['location'], dct["text"]) for dct in polls_json]
colNames = ("location", "text")
tweets_df = pd.DataFrame(tweets_list, columns=colNames)
print(tweets_df.head())
yields
location text
0 None RT @webseriestoday: Democracy Now: Noam Chomsk...
1 Pittsburgh PA "The tuxedo was an invention of the Koch broth...
2 None RT @webseriestoday: Democracy Now: Noam Chomsk...
3 None RT @webseriestoday: Democracy Now: Noam Chomsk...
Another, better way to fix the problem would be to write valid JSON in raw_tweets.json
. After all, if you wanted to send the file to someone else, you'll make their life easier if the file contained valid JSON. We'd need to see more of your code to suggest exactly how to fix it, but in general you would want to use json.dump
to write a list of dicts as JSON to a file instead of "manually" writing JSON snippets with saveFile.write(','.join(self.tweet_data))
:
tweets = []
for i in loop:
tweets.append(tweet_dict)
with io.open('raw_tweets.json', 'w', encoding='utf-8') as saveFile:
json.dump(tweets, saveFile)
If raw_tweets.json
contained valid JSON then you could load it into a Python list of dicts using:
with open('raw_tweets.json', encoding="utf8") as jsonFile:
polls_json = json.load(jsonFile)
The rest of the code, to load the desired parts into a DataFrame would remain the same.
How was this line of code constructed:
tweets_list = [(dct['user']['location'], dct["text"]) for dct in polls_json]
In an interactive Python session I inspected one dict in polls_json
:
In [114]: import pandas as pd
In [115]: import json
In [116]: with open('raw_tweets.json', encoding="utf8") as jsonFile:
polls_json = json.loads('[{}]'.format(jsonFile.read()))
In [117]: dct = polls_json[1]
In [118]: dct
Out[118]:
{'contributors': None,
'coordinates': None,
...
'text': "Like the old Soviet leaders, Bernie refused to wear a tux at last night's black-tie dinner.",
'truncated': False,
'user': {'contributors_enabled': False,
...
'location': 'Washington DC',}}
It is quite large, so I've omitted parts of it here to make the result more readable.
Assuming that I correctly guessed the text
and location
values you are looking for,
we can see that given this dict, dct
, we can access the desired text
value using dct['text']
. But the location'
key is inside the nested dict, dct['user']
. Therefore, we need to use dct['user']['location']
to extract the location value.
By the way, Pandas provides a convenient method for reading JSON into a DataFrame, pd.read_json
, but it relies on the JSON data being "flat". Because the data we desire is in nested dicts, I used custom code, the list comprehension
tweets_list = [(dct['user']['location'], dct["text"]) for dct in polls_json]
to extract the values instead of pd.read_json
.
Upvotes: 2