pceccon
pceccon

Reputation: 9844

Processing huge json file in Python - ValueError

I have this piece of code to process a big file in Python:

import urllib2, json, csv
import requests

def readJson(url):
    """
    Read a json file.
    :param url: url to be read.
    :return: a json file.
    """
    try:
        response = urllib2.urlopen(url)
        return json.loads(response.read(), strict=False)
    except urllib2.HTTPError as e:
        return None

def getRoadsTopology():
    nodes = []
    edges = []

    url = "https://data.cityofnewyork.us/api/geospatial/svwp-sbcd?method=export&format=GeoJSON"
    data = readJson(url)
    print "Done reading road bed"
    print "Processing road bed..."

    v_index = 0;
    roads = 0
    for road in data['features']:
        n_index = len(nodes)
        # (long, lat)
        coordinates = road['geometry']['coordinates'][0]
        for i in range(0, len(coordinates)):
            lat_long = coordinates[i]
            nodes.append((lat_long[1], lat_long[0]))

        for i in range(n_index, len(nodes)-1-n_index):
            print i, i+1
            edges.append((i, i+1))
    return nodes, edges

Sometimes it works, but a lot of times I get the same error at different lines:

File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting : delimiter: line 7 column 4 (char 74317829)



File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting , delimiter: line 5 column 1 (char 72149996)

I'm wondering what causes these error, and at different lines, and how I could solve it.

The site that provide this file has also a successful presentation of it:

https://data.cityofnewyork.us/City-Government/road/svwp-sbcd

Upvotes: 0

Views: 324

Answers (1)

Ketzak
Ketzak

Reputation: 628

It looks like your JSON input is malformed. The error is being thrown from raw_decode, which is part of the JSON library--so it's dumping before it even gets to your processing code. The inconsistency of the results would lead me to think maybe the JSON is somehow getting corrupted, or not completely delivered.

My next step would be to pull the JSON from the source, store in a local file, lint it to make sure it's valid, then test your program by from that file directly.

Update:

Curious, I downloaded the file several times. A couple of them came out being far too small. It seems the real size is around 121M. Once I got a couple of those consistently, I ran your program against it, replacing your url-loader with a file loader. It works perfectly, unless I have too little RAM, which then yields a segfault.

I had the most success downloading the file on a virtual server on DigitalOcean--it successfully got it every time. When doing it on my local machine, the file was truncated, which leads me to believe that perhaps the server sending you the JSON is cutting off the stream after some timeout period. The DigitalOcean server has a massive throughput, averaging 12 MB/s, pulling the entire file in 10 seconds. My local machine could only pull less than 1MB/s, and couldn't finish. It stopped at 2 minutes, only having pulled 75Mb. The sending server probably has a 2 minute time limit on requests.

This would explain why their page works, but your script struggles to get it all. The map data is being processed by another server that can pull the data from the source in the time allowed, then streamed piece by piece as needed to the web viewer.

Upvotes: 3

Related Questions