Why am I unable to receive data from this website?

Question

I am trying to eventually make a program parsing the html of a particular website, but I get a bad status line error for the website I'd like to use. This code has worked fine for any other website I've tried. Is this something they are doing intentionally and there is nothing I can do?

My code:

from lxml import html
import requests

webpage = 'http://www.whosampled.com/search/?q=de+la+soul'
page = requests.get(webpage)
tree = html.fromstring(page.text)

The error message I receive:

Traceback (most recent call last):
  File "/home/kyle/Documents/web.py", line 6, in 
    page = requests.get(webpage)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 65, in get
    return request('get', url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 49, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 461, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 415, in send
    raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', BadStatusLine("''",))

alecxe · Accepted Answer

Provide a User-Agent header and it would work for you:

webpage = 'http://www.whosampled.com/search/?q=de+la+soul'
page = requests.get(webpage, 
                    headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'})

Proof:

>>> from lxml import html
>>> import requests
>>> 
>>> webpage = 'http://www.whosampled.com/search/?q=de+la+soul'
>>> page = requests.get(webpage, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'})
>>> tree = html.fromstring(page.content)
>>> tree.findtext('.//title')
Search Results for "de la soul" | WhoSampled

FYI, it would also work if you switch to https:

>>> webpage = 'https://www.whosampled.com/search/?q=de+la+soul' 
>>> page = requests.get(webpage)
>>> tree = html.fromstring(page.content) 
>>> tree.findtext('.//title')                                                                                                                     
'Search Results for "de la soul" | WhoSampled'

Why am I unable to receive data from this website?

Answers (1)

Related Questions