Luka
Luka

Reputation: 255

Parse large JSON file in Python

I'm trying to parse a really large JSON file in Python. The file has 6523440 lines but is broken into a lot of JSON objects.

The structure looks like this:

[
  {
    "projects": [
     ...
    ]
  }
]
[
  {
    "projects": [
     ...
    ]
  }
]
....
....
....

and it goes on and on...

Every time I try to load it using json.load() I get an error

ValueError: Extra data: line 2247 column 1 - line 6523440 column 1 (char 101207 - 295464118)

On the line where the first object ends and the second one starts. Is there a way to load them separately or anything similar?

Upvotes: 3

Views: 2799

Answers (2)

Dunes
Dunes

Reputation: 40843

Try using json.JSONDecoder.raw_decode. It still requires you to have the entire document in memory, but allows you to iteratively decode many objects from one string.

import re
import json

document = """
[
    1,
    2,
    3
]
{
    "a": 1,
    "b": 2,
    "c": 3
}
"""

not_whitespace = re.compile(r"\S")

decoder = json.JSONDecoder()

items = []
index = 0
while True:
    match = not_whitespace.search(document, index)
    if not match:
        break

    item, index = decoder.raw_decode(document, match.start())
    items.append(item)

print(items)

Upvotes: 0

shyam
shyam

Reputation: 9368

You can try using a streaming json library like ijson:

Sometimes when dealing with a particularly large JSON payload it may worth to not even construct individual Python objects and react on individual events immediately producing some result

Upvotes: 2

Related Questions