Reputation: 63
I am trying to scrape an online news website.
st_url = "https://www.straitstimes.com/"
page = requests.get(st_url)
# Output:
ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))
I am still new to web scraping and I am not familiar if this means the website bans me from scraping or whether I am just doing it wrong.
Other than trying Requests, I have tried finding the XML API link in Chrome Dev Tools but unable to find so.
Would appreciate some help here. Thank you.
Upvotes: 0
Views: 507
Reputation: 311526
If you turn on debug logging...
import logging
logging.basicConfig(level='DEBUG')
...you'll see that you're getting a 403 response from the website:
>>> import logging
>>> import requests
>>> logging.basicConfig(level='DEBUG')
>>> st_url = "https://www.straitstimes.com/"
>>> page = requests.get(st_url)
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.straitstimes.com:443
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET / HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET /global HTTP/1.1" 403 345
It looks as if the site may be rejecting whatever requests
sends as the default user agent. I tried making the same request with curl
from the command line and it worked fine.
If I grab a current Firefox user-agent string and make the request, it seems to work:
>>> page = requests.get(st_url, headers={'user-agent': 'Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0'})
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.straitstimes.com:443
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET / HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET /global HTTP/1.1" 200 51378
You can see in this case that the request was successful.
Upvotes: 1