Zin Yosrim
Zin Yosrim

Reputation: 1684

web-scraping with python 3.6 and beautifulsoup - getting Invalid URL

I want to work with this page in Python: http://www.sothebys.com/en/search-results.html?keyword=degas%27

This is my code:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.sothebys.com/en/search-results.html?keyword=degas%27')

soup = BeautifulSoup(page.content, "lxml")
print(soup)

I'm getting following output:

<html><head>
<title>Invalid URL</title>
</head><body>
<h1>Invalid URL</h1>
The requested URL "[no URL]", is invalid.<p>
Reference #9.8f4f1502.1494363829.5fae0e0e
</p></body></html>

I can open the page with my browser from the same machine and don't get any error message. When I use the same code with another URL the correct HTML content is fetched:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.christies.com/lotfinder/searchresults.aspx?&searchtype=p&action=search&searchFrom=header&lid=1&entry=degas')

soup = BeautifulSoup(page.content, "lxml")
print(soup)

I also tested other URLs (reddit, google, ecommerce sites) and didn't encounter any issue. So, the same code works with one URL and with another one not. Where is the problem?

Upvotes: 3

Views: 2962

Answers (2)

Roshni Amber
Roshni Amber

Reputation: 562

change your code as

soup = BeautifulSoup(page.text, "lxml")

If you are using page.content then converting byte array to string would help you out, but you should go with page.text

Upvotes: 3

MD. Khairul Basar
MD. Khairul Basar

Reputation: 5110

This website blocks the requests not coming from any browser thus you get the Invalid URL error. Adding custom headers to the request works fine.

import requests
from bs4 import BeautifulSoup

ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.sothebys.com/en/search-results.html?keyword=degas%27"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")
print(soup)

Upvotes: 2

Related Questions