web-scraping with python 3.6 and beautifulsoup - getting Invalid URL

Question

I want to work with this page in Python: http://www.sothebys.com/en/search-results.html?keyword=degas%27

This is my code:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.sothebys.com/en/search-results.html?keyword=degas%27')

soup = BeautifulSoup(page.content, "lxml")
print(soup)

I'm getting following output:


Invalid URL

Invalid URL
The requested URL "[no URL]", is invalid.
Reference #9.8f4f1502.1494363829.5fae0e0e

I can open the page with my browser from the same machine and don't get any error message. When I use the same code with another URL the correct HTML content is fetched:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.christies.com/lotfinder/searchresults.aspx?&searchtype=p&action=search&searchFrom=header&lid=1&entry=degas')

soup = BeautifulSoup(page.content, "lxml")
print(soup)

I also tested other URLs (reddit, google, ecommerce sites) and didn't encounter any issue. So, the same code works with one URL and with another one not. Where is the problem?

MD. Khairul Basar · Accepted Answer

This website blocks the requests not coming from any browser thus you get the Invalid URL error. Adding custom headers to the request works fine.

import requests
from bs4 import BeautifulSoup

ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.sothebys.com/en/search-results.html?keyword=degas%27"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")
print(soup)

web-scraping with python 3.6 and beautifulsoup - getting Invalid URL

Answers (2)

Related Questions