Web Scraping: Page exists but getting 404 using requests/urllib

Question

I am trying to scrape the following page: http://usbcdirectory.com/listing/1-us-black-chambers

I am using Python 3.5.0

Here is my code:

urllib.request.urlopen('http://usbcdirectory.com/listing/1-us-black-chambers')

Using the above I am getting a 404 not found error. However, the page exists when I open it from the browser.

I tried searching solution to this problem and here is what I have found:

Change urllib to requests: I already did this and got a 404 error in the status code

>>>requests.get('http://usbcdirectory.com/listing/1-us-black-chambers')
    
Request <404>

I checked my link which is correct
I tried to find out if the page is generated using JavaScript. I believe it is not.

What is the issue with the web page here? Are they blocking scraping in some way or it is an issue with the URL?

ritiek · Accepted Answer

As you guessed, they are probably blocking your request. You can pass custom headers to simulate your request more like a request from a real browser:

import requests

url = 'http://usbcdirectory.com/listing/1-us-black-chambers'
headers = {'Accept': 'text/html'}
response = requests.get(url, headers=headers)
print(response.status_code)

Web Scraping: Page exists but getting 404 using requests/urllib

Answers (2)

Related Questions