Ben
Ben

Reputation: 7124

Unable to GET entire page with Python request

I'm trying to get a long JSON response (~75 Mbytes) from a webpage, However I can only receive the first 25 Mbytes or so.

I've used urllib2 and python-requests but neither work. I've tried reading parts in separately and streaming the data, but this doesn't work either.

An example of the data can be found here:

http://waterservices.usgs.gov/nwis/iv/?site=14377100&format=json&parameterCd=00060&period=P260W

My code is as follows:

r = requests.get("http://waterservices.usgs.gov/nwis/iv/?site=14377100&format=json&parameterCd=00060&period=P260W")

usgs_data = r.json() # script breaks here

# Save Longitude and Latitude of river
latitude = usgs_data["value"]["timeSeries"][0]["sourceInfo"]["geoLocation"]["geogLocation"]["latitude"]
longitude = usgs_data["value"]["timeSeries"][0]["sourceInfo"]["geoLocation"]["geogLocation"]["longitude"]

# dictionary of all past river flows in cubic feet per second
river_history = usgs_data['value']['timeSeries'][0]['values'][0]['value']

It breaks with:

ValueError: Expecting object: line 1 column 13466329 (char 13466328)

When the script tries to decode the JSON (i.e. usgs_data = r.json()).

This is because the full data hasn't been received and is therefore not a valid JSON object.

Upvotes: 1

Views: 1603

Answers (1)

mhawke
mhawke

Reputation: 87064

The problem seems to be that the server won't serve more than 13MB of data at a time.

I have tried that URL using a number of HTTP clients including curl and wget, and all of them bomb out at about 13MB. I have also tried enabling gzip compression (as should you), but the results were still truncated at 13MB after decompression.

You are requesting too much data because the period=P260W specifies 260 weeks. If you try setting period=P52W you should find that you are able to retrieve a valid JSON response.

To reduce the amount of data transferred, set the Accept-Encoding header like this:

url = 'http://waterservices.usgs.gov/nwis/iv/'
params = {'site': 11527000, 'format': 'json', 'parameterCd': '00060', 'period': 'P52W'}
r = requests.get(url, headers={'Accept-Encoding': 'gzip,deflate'})

Upvotes: 3

Related Questions