Kunal Sharma
Kunal Sharma

Reputation: 107

Split 30 Gb json file into smaller files

I am facing memory issue in reading a json file which is 30 GB in size. Is there any direct way in Python3.x like we have in unix where we can split the json file into smaller files based on the lines.

e.g. first 100000 records go in first slit file and then rest go to subsequent child json file?

Upvotes: 0

Views: 524

Answers (1)

The Fool
The Fool

Reputation: 20467

Depending on your input data and if its structure is known and consistent, it will be more hard or easy.

In my example here the idea is to read the file line by line with a lazy generator and write new files whenever a valid object can be constructed from the input. Its a bit like manual parsing.

In the real world case this logic when to write to a new file would highly depend on your input and what you are trying to achieve.

Some sample data

[
    {
        "color": "red",
        "value": "#f00"
    },
    {
        "color": "green",
        "value": "#0f0"
    },
    {
        "color": "blue",
        "value": "#00f"
    },
    {
        "color": "cyan",
        "value": "#0ff"
    },
    {
        "color": "magenta",
        "value": "#f0f"
    },
    {
        "color": "yellow",
        "value": "#ff0"
    },
    {
        "color": "black",
        "value": "#000"
    }
]
# create a generator that yields each individual line
lines = (l for l in open('data.json'))

# o is used to accumulate some lines before
# writing to the files
o=''

# itemCount is used to count the number of valid json objects
itemCount=0

# read the file line by line to avoid memory issues
i=-1
while True:
  try:
    line = next(lines)
  except StopIteration:
    break
  i=i+1
  # ignore first square brackets
  if i == 0:
    continue
  # in this data I know every 5th lines a new object will begin
  # this logic depends on your input data
  if i%4==0:
    itemCount+=1
    # at this point I am able to create avalid json object
    # based on my knowledge of the input file structure
    validObject=o+line.replace("},\n", '}\n')
    o=''
    # now write each object to its own file
    with open(f'item-{itemCount}.json', 'w') as outfile:
      outfile.write(validObject)
  else:
    o+=line

enter image description here

Here is a repl with the working example: https://replit.com/@bluebrown/linebyline

Upvotes: 1

Related Questions