Reputation: 83
I'm writing a script using Scrapy, but I'm having a trouble with the failed HTTP responses. Specifically, I'm trying to scrape from "https://www.crunchbase.com/" but I keep getting HTTP status code 416. Can websites block spiders from scraping their contents?
Upvotes: 6
Views: 4605
Reputation: 4381
You will need to have "br" and "sdch" as accepted encodings if you use Chrome as user agent.
Here is a sample:
html_headers = {
'Accept':'*/*',
'Accept-Encoding':'gzip, deflate, br, sdch',
'Connection':'keep-alive',
'Host':'www.crunchbase.com',
'Referer':'https://www.crunchbase.com/',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36'
}
res = requests.get('https://www.crunchbase.com/', headers=html_headers)
As previously said by someone else, in Chrome, open the developers console (the three dots in the upper right corner -> More tools -> Developer Console, or press Ctrl+Shift+I), go to the "Network" tab, Reload the page, click on the red dot to stop the recording, click on a file and on the right you will see the tab "Requests header"
EDIT: If you want to use a real web engine, like WebKit, you probably won't need any trick at all. Ex.
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebKitWidgets import QWebPage
class Client(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self.on_page_load)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def on_page_load(self):
self.app.quit()
cont = Client(url).mainFrame().toHtml()
soup = bs.BeautifulSoup(cont,'lxml')
Another advantage of this approach is that it processes JavaScript, so it gets around dynamic loading. E.g. if a Javascript called on page load substitute some text in the page, with this approach you can get the new text
Upvotes: 1
Reputation: 3765
You are right http://crunchbase.com blocks bots. It still serves an HTML page "Pardon our Interruption", which explains why they think that your are bot, and provide a form to request unblock (even though with status code 416).
According to VP of Marketing at Distil Networks, Crunchbase uses distil networks antibot.
https://www.quora.com/How-does-distil-networks-bot-and-scraper-detection-work
After several attempts, even my browser access was successfully blocked there. I submitted an unblock request and was enabled again. Not sure about other distil protected sites, but you can try to ask crunchbase management nicely.
Upvotes: 1
Reputation: 59651
What's happening is that the website is looking at the headers attached to your request and deciding that you're not a browser and therefore blocking your request.
However, there is nothing that website can do to differentiate between Scrapy and Firefox/Chrome/IE/Safari if you decide to send the same headers as a browser. In Chrome, open up the Network Tools console, and you will see exactly the headers it is sending. Copy these headers into your Scrapy request and everything will work.
You might want to start by sending the same User-Agent
header as your browser.
How to send these headers with your Scrapy request is documented here.
Upvotes: 6