Mulloy
Mulloy

Reputation: 176

Opening a large JSON file and converting it to CSV

I'm trying to convert a large JSON file (4.35 GB) to CSV.

My initial approach was importing it, converting it to a data frame (I only need what's in features), doing some data manipulation, and exporting it to CSV.

with open('Risk_of_Flooding_from_Rivers_and_Sea.json') as data_file:    
    d = json.load(data_file)  

# Grabbing the data in 'features'.
json_df = json_normalize(d, 'features')
df = pd.DataFrame(json_df)

I've been successful at doing this with small samples of the whole dataset, but I'm unable to import the whole thing at once, even after leaving it running for 9 hours. Even though my PC has 16 GB of RAM, I'm assuming it's a memory issue even though there are no errors.

Here's a small sample of the JSON data I'm using:

{
    "type": "FeatureCollection",
    "crs": {
        "type": "name",
        "properties": {
            "name": "EPSG:27700"
        }
    },
    "features": [
        {
            "type": "Feature",
            "id": 1,
            "geometry": {
                "type": "Polygon",
                "coordinates": [
                    [
                        [
                            289344.50009999985,
                            60397.26009999961
                        ],
                        [
                            289347.2400000002,
                            60400
                        ]
                    ]
                ]
            },
            "properties": {
                "OBJECTID": 1,
                "prob_4band": "Low",
                "suitability": "National to County",
                "pub_date": 1522195200000,
                "shape_Length": 112.16436096255808,
                "shape_Area": 353.4856092588217
            }
        },
        {
            "type": "Feature",
            "id": 2,
            "geometry": {
                "type": "Polygon",
                "coordinates": [
                    [
                        [
                            289250,
                            60550
                        ],
                        [
                            289200,
                            60550
                        ]
                    ]
                ]
            },
            "properties": {
                "OBJECTID": 2,
                "prob_4band": "Very Low",
                "suitability": "National to County",
                "pub_date": 1522195200000,
                "shape_Length": 985.6295076665662,
                "shape_Area": 18755.1377842949
            }
        },

I've looked into splitting up the JSON file into smaller chunks, but I've had no success in my attempts. With the code below I'm getting the error

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1).
with open(os.path.join('E:/Jupyter', 'Risk_of_Flooding_from_Rivers_and_Sea.json'), 'r',
          encoding='utf-8') as f1:
    ll = [json.loads(line.strip()) for line in f1.readlines()]
    
    print(len(ll))
          
    size_of_the_split = 10000
    total = len(ll) // size_of_the_split
          
    print(total+1)
          
    for i in range(total+1):
        json.dump(ll[i * size_of_the_split:(i + 1) * size_of_the_split], open(
            "E:/Jupyter/split" + str(i+1) + ".json", 'w',
            encoding='utf-8'), ensure_ascii=False, indent=True)

I'm just wondering what my options are. Is the way I'm doing it the best way to do this? If it is, what can I change?

I get the smaller samples from this source, but they can't be too large.

Upvotes: 2

Views: 3226

Answers (1)

user15398259
user15398259

Reputation:

For splitting up the data you can use a streaming parser such as ijson e.g.

import ijson
import itertools
import json

chunk_size = 10_000

filename = 'Risk_of_Flooding_from_Rivers_and_Sea.json'

with open(filename, mode='rb') as file_in:
    features = ijson.items(file_in, 'features.item', use_float=True)
    chunk = list(itertools.islice(features, chunk_size))
    count = 1
    while chunk:
        with open(f'features-split-{count}.json', mode='w') as file_out:
            json.dump(chunk, file_out, ensure_ascii=False, indent=4)
        chunk = list(itertools.islice(features, chunk_size))
        count += 1

Upvotes: 2

Related Questions