Reputation: 1540
I have a slightly large python3 import script where part of it is fetching some URL and parsing the body.
The code looks like this:
import requests
url = 'http://...' <-- some url here which returns an html page with curl
req = requests.get(url)
print("--- status_code %s" % req.status_code)
print("--- body length %s" % len(req.text))
I am getting:
--- status_code 200
--- body length 0
Looking at the headers I see this:
{'Keep-Alive': 'timeout=5, max=100', 'Content-Length': '0', 'Date': 'Mon, 06 Nov 2017 03:14:49 GMT', 'Server': 'Apache/2.4.18 (Ubuntu)', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=utf-8'}
I've tried searching everywhere about why the content length is 0 and I am not able to figure it out.
To test this as a unit I created a small script just to test the same URL with the same snippet. This test script is working fine!
Why is one script working but not the other? I am reading that this is blocking by default so it should be working in both cases. Is there anything I am missing?
Upvotes: 0
Views: 2345
Reputation: 1540
I figured this out. The issue was my stupidity. The URL I was trying to fetch contained a "\n" character in query which was causing the page to throw an error. Thanks Klaus for reminding me to check the server.
Upvotes: 0
Reputation: 846
How many times and how often do you try to access the server from the main script and from the snippet?
If you try to parse some external site, it may become "angry" and return you zero-sized content. It is quite a common measure to prevent site grabbing. In this scenario, your test script would work just fine as long as it is executed only once or twice. Your main script, however, after a certain number of executions (five, or ten, or ten per second) will be restricted by the site for some amount of time.
If it's the case, you can try to insert some delay in your script.
Upvotes: 1