Reputation: 435
I want to scrape data from a website; however I keep getting the HTTP: Error 405: Not Allowed. What am I doing wrong?
(I have looked at the documentation, and tried their code, with only my url in place of the example's; I still have the same error.)
Here's the code:
import requests, urllib
from urllib.request import Request, urlopen
list_url= ["http://www.glassdoor.com/Reviews/WhiteWave-Reviews-E9768.htm"]
for url in list_url:
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
response=urllib.request.urlopen(req).read()
If I skip the user-agent term, I get HTTP Error 403: Forbidden.
In the past, I have successfully scraped data (from another website) using the following:
for url in list_url:
raw_html = urllib.request.urlopen(url).read()
soup=None
soup = BeautifulSoup(raw_html,"lxml")
Ideally, I would like to keep a similar structure, that is, pass the content of the fetched url to BeautifulSoup. Thanks!
Upvotes: 0
Views: 4768
Reputation: 55
The error you are getting is "Pardon our Interruption. something about your browser made us think you were a bot". Implies scraping ain't permitted and they have anti-scraping bots on their webpages.
Try using a fake-browser. Link to how to make requests using a fake-browser. (How to use Python requests to fake a browser visit? )
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'http://www.glassdoor.com/Reviews/WhiteWave-Reviews-E9768.htm'
web_page = requests.get(url,headers=headers)
I tried this and what I found is their page is getting loaded via JS. So I think you might want to use a headless Browser ( Selenium / PhantomJS ) and scrape rendered html pages. Hope it helps.
Upvotes: 1
Reputation: 47364
Not sure about exactly reason of the issue, but try this code it is working for me:
import http.client
connection = http.client.HTTPSConnection("www.glassdoor.com")
connection.request("GET", "/Reviews/WhiteWave-Reviews-E9768.htm")
res = connection.getresponse()
data = res.read()
Upvotes: 0