CookieData
CookieData

Reputation: 63

How to scrape website with content-encoding using Python?

I am trying to scrape an online news website.

st_url = "https://www.straitstimes.com/"
page = requests.get(st_url)

# Output: 
ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))

I am still new to web scraping and I am not familiar if this means the website bans me from scraping or whether I am just doing it wrong.

Other than trying Requests, I have tried finding the XML API link in Chrome Dev Tools but unable to find so.

Would appreciate some help here. Thank you.

Upvotes: 0

Views: 507

Answers (1)

larsks
larsks

Reputation: 311526

If you turn on debug logging...

import logging
logging.basicConfig(level='DEBUG')

...you'll see that you're getting a 403 response from the website:

>>> import logging
>>> import requests
>>> logging.basicConfig(level='DEBUG')
>>> st_url = "https://www.straitstimes.com/"
>>> page = requests.get(st_url)
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.straitstimes.com:443
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET / HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET /global HTTP/1.1" 403 345

It looks as if the site may be rejecting whatever requests sends as the default user agent. I tried making the same request with curl from the command line and it worked fine.

If I grab a current Firefox user-agent string and make the request, it seems to work:

>>> page = requests.get(st_url, headers={'user-agent': 'Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0'})
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.straitstimes.com:443
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET / HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:https://www.straitstimes.com:443 "GET /global HTTP/1.1" 200 51378

You can see in this case that the request was successful.

Upvotes: 1

Related Questions