Reputation: 97
I am using tweepy's Streamlistener
to collect Twitter Data and the code I am using generates a JSONL file with a bunch of meta data.
Now I would like to convert the file into a CSV for which I found a code for just that. Unfortunately I have run into the Error reading:
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 7833)
I have read through other threads and I reckon it has something to do with json.loads
not being able to process multiple parts of data within the json file (which is of course the case for my json list file).
How I can circumvent this problem within the code? Or do I have to use a completely different approach to convert the file? (I am using python 3.6, and the tweets I am streaming are mostly in Arabic).
__author__ = 'seandolinar'
import json
import csv
import io
'''
creates a .csv file using a Twitter .json file
the fields have to be set manually
'''
data_json = io.open('stream_____.jsonl', mode='r', encoding='utf-8').read() #reads in the JSON file
data_python = json.loads(data_json)
csv_out = io.open('tweets_out_utf8.csv', mode='w', encoding='utf-8') #opens csv file
fields = u'created_at,text,screen_name,followers,friends,rt,fav' #field names
csv_out.write(fields)
csv_out.write(u'\n')
for line in data_python:
#writes a row and gets the fields from the json object
#screen_name and followers/friends are found on the second level hence two get methods
row = [line.get('created_at'),
'"' + line.get('text').replace('"','""') + '"', #creates double quotes
line.get('user').get('screen_name'),
unicode(line.get('user').get('followers_count')),
unicode(line.get('user').get('friends_count')),
unicode(line.get('retweet_count')),
unicode(line.get('favorite_count'))]
row_joined = u','.join(row)
csv_out.write(row_joined)
csv_out.write(u'\n')
csv_out.close()
Upvotes: 1
Views: 2693
Reputation: 55599
If the data file consists of multiple lines, each of which is a single json object, you can use a generator to decode the lines one at a time.
def extract_json(fileobj):
# Using "with" ensures that fileobj is closed when we finish reading it.
with fileobj:
for line in fileobj:
yield json.loads(line)
The only changes to your code is that the data_json
file is not read explicitly, and data_python
is the result of calling extract_json
rather than json.loads
. Here's the amended code:
import json
import csv
import io
'''
creates a .csv file using a Twitter .json file
the fields have to be set manually
'''
def extract_json(fileobj):
"""
Iterates over an open JSONL file and yields
decoded lines. Closes the file once it has been
read completely.
"""
with fileobj:
for line in fileobj:
yield json.loads(line)
data_json = io.open('stream_____.jsonl', mode='r', encoding='utf-8') # Opens in the JSONL file
data_python = extract_json(data_json)
csv_out = io.open('tweets_out_utf8.csv', mode='w', encoding='utf-8') #opens csv file
fields = u'created_at,text,screen_name,followers,friends,rt,fav' #field names
csv_out.write(fields)
csv_out.write(u'\n')
for line in data_python:
#writes a row and gets the fields from the json object
#screen_name and followers/friends are found on the second level hence two get methods
row = [line.get('created_at'),
'"' + line.get('text').replace('"','""') + '"', #creates double quotes
line.get('user').get('screen_name'),
unicode(line.get('user').get('followers_count')),
unicode(line.get('user').get('friends_count')),
unicode(line.get('retweet_count')),
unicode(line.get('favorite_count'))]
row_joined = u','.join(row)
csv_out.write(row_joined)
csv_out.write(u'\n')
csv_out.close()
Upvotes: 1