wheikes
wheikes

Reputation: 11

Converting a very large JSON file to CSV

I have a JSON file that is about 8GB in size. When I try to convert the file using this script:

import csv
import json

infile = open("filename.json","r")
outfile = open("data.csv","w")

writer = csv.writer(outfile)

for row in json.loads(infile.read()):
    writer.write(row)

I get this error:

Traceback (most recent call last):
  File "E:/Thesis/DataDownload/PTDataDownload/demo.py", line 9, in <module>
    for row in json.loads(infile.read()):
MemoryError

I'm sure this has to do with the size of the file. Is there a way to ensure the file will convert to a CSV without the error?

This is a sample of my JSON code:

     {"id":"tag:search.twitter.com,2005:905943958144118786","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:899030045234167808","link":"http://www.twitter.com/NAJajsjs3","displayName":"NAJajsjs","postedTime":"2017-08-19T22:07:20.000Z","image":"https://pbs.twimg.com/profile_images/905943685493391360/2ZavxLrD_normal.jpg","summary":null,"links":[{"href":null,"rel":"me"}],"friendsCount":23,"followersCount":1,"listedCount":0,"statusesCount":283,"twitterTimeZone":null,"verified":false,"utcOffset":null,"preferredUsername":"NAJajsjs3","languages":["tr"],"favoritesCount":106},"verb":"post","postedTime":"2017-09-08T00:00:45.000Z","generator":{"displayName":"Twitter for iPhone","link":"http://twitter.com/download/iphone"},"provider":{"objectType":"service","displayName":"Twitter","link":"http://www.twitter.com"},"link":"http://twitter.com/NAJajsjs3/statuses/905943958144118786","body":"@thugIyfe Beyonce do better","object":{"objectType":"note","id":"object:search.twitter.com,2005:905943958144118786","summary":"@thugIyfe Beyonce do better","link":"http://twitter.com/NAJajsjs3/statuses/905943958144118786","postedTime":"2017-09-08T00:00:45.000Z"},"inReplyTo":{"link":"http://twitter.com/thugIyfe/statuses/905942854710775808"},"favoritesCount":0,"twitter_entities":{"hashtags":[],"user_mentions":[{"screen_name":"thugIyfe","name":"dari.","id":40542633,"id_str":"40542633","indices":[0,9]}],"symbols":[],"urls":[]},"twitter_filter_level":"low","twitter_lang":"en","display_text_range":[10,27],"retweetCount":0,"gnip":{"matching_rules":[{"tag":null,"id":6134817834619900217,"id_str":"6134817834619900217"}]}}

(sorry for the ugly formatting)

An alternative may be that I have about 8000 smaller json files that I combined to make this file. They are each within their own folder with just the single json in the folder. Would it be easier to convert each of these individually and then combine them into one csv?

The reason I am asking this is because I have very basic python knowledge and all the answers to similar questions that I have found are way more complicated than I can understand. Please help this new python user to read this json as a csv!

Upvotes: 1

Views: 7702

Answers (2)

Russell Wong
Russell Wong

Reputation: 101

Yes, it is absolutely can be done in a very easy way. I opened a 4GB json file in a few seconds. For me, I dont need to convert to csv. But it can be done in a very easy way.

  1. start the mongodb with Docker.
  2. create a temporary database on mongodb, e.g. test
  3. copy the json file to into the Docker container
  4. run mongoimport command

    docker exec -it container_id mongoimport --db test --collection data --file /tmp/data.json --jsonArray

  5. run the mongo export command to export to csv

    docker exec -it container_id mongoexport --db test --collection data --csv --out data.csv --fields id,objectType

Upvotes: 0

OneCricketeer
OneCricketeer

Reputation: 191681

Would it be easier to convert each of these individually and then combine them into one csv?

Yes, it certainly would

For example, this will put each JSON object/array (whatever is loaded from the file) onto its own line of a single CSV.

import json, csv
from glob import glob

with open('out.csv', 'w') as f:
    for fname in glob("*.json"):  # Reads all json from the current directory
        with open(fname) as j:
            f.write(str(json.load(j)))
            f.write('\n')

Use glob pattern **/*.json to find all json files in nested folders

Not really clear what for row in ... was doing for your data since you don't have an array. Unless you wanted each JSON key to be a CSV column?

Upvotes: 1

Related Questions