Reputation: 1684
I want to work with this page in Python: http://www.sothebys.com/en/search-results.html?keyword=degas%27
This is my code:
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.sothebys.com/en/search-results.html?keyword=degas%27')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
I'm getting following output:
<html><head>
<title>Invalid URL</title>
</head><body>
<h1>Invalid URL</h1>
The requested URL "[no URL]", is invalid.<p>
Reference #9.8f4f1502.1494363829.5fae0e0e
</p></body></html>
I can open the page with my browser from the same machine and don't get any error message. When I use the same code with another URL the correct HTML content is fetched:
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.christies.com/lotfinder/searchresults.aspx?&searchtype=p&action=search&searchFrom=header&lid=1&entry=degas')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
I also tested other URLs (reddit, google, ecommerce sites) and didn't encounter any issue. So, the same code works with one URL and with another one not. Where is the problem?
Upvotes: 3
Views: 2962
Reputation: 562
change your code as
soup = BeautifulSoup(page.text, "lxml")
If you are using page.content
then converting byte array to string would help you out, but you should go with page.text
Upvotes: 3
Reputation: 5110
This website blocks the requests not coming from any browser thus you get the Invalid URL
error. Adding custom headers to the request works fine.
import requests
from bs4 import BeautifulSoup
ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.sothebys.com/en/search-results.html?keyword=degas%27"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")
print(soup)
Upvotes: 2