Reputation: 107
I'm building a chatbot database atm. I uses data from pushshift.io. In order to deal with big datafile, (I understand that json
loads everything into RAM, so if you only have 16GB RAM and working with 30GB of data, that is a nono), I wrote a bash script that split the big file into smaller chunk of 3GB of file so that I can run it through json.loads
(or pd.read_json
). The problem whenever I run my code it returns
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Thus I take a look into the temp
json file that I just created and I see this happens in my JSON file:
ink_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
The sample correction of the data looks like this
{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
I notice that my bash script split the file without paying attention to the JSON objects. So my question is are there ways to write a function in python that can detect JSON objects that are not correctly formatted and deleted it?
Upvotes: 0
Views: 50
Reputation: 4427
There isn't a lot of information to go on, but I would challenge the frame a little.
There are several incremental json parsers available in Python. A quick search shows ijson
should allow you to traverse your very large data structure without exploding.
You also should consider another data format (or a real database), or you will easily find yourself spending time reimplementing much much slower versions of features that already exist with the right tools.
Upvotes: 1
Reputation: 4767
If you are using the json standard library, then calling json.loads
on badly formatted data will return JSONDecodeError
. You can put your code in a try-catch statement and check if this exception occurs to make sure you only process correctly formatted data.
Upvotes: 0