Reputation: 3
I am trying to eventually make a program parsing the html of a particular website, but I get a bad status line error for the website I'd like to use. This code has worked fine for any other website I've tried. Is this something they are doing intentionally and there is nothing I can do?
My code:
from lxml import html
import requests
webpage = 'http://www.whosampled.com/search/?q=de+la+soul'
page = requests.get(webpage)
tree = html.fromstring(page.text)
The error message I receive:
Traceback (most recent call last):
File "/home/kyle/Documents/web.py", line 6, in <module>
page = requests.get(webpage)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 65, in get
return request('get', url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 49, in request
response = session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 461, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', BadStatusLine("''",))
Upvotes: 0
Views: 346
Reputation: 474031
Provide a User-Agent
header and it would work for you:
webpage = 'http://www.whosampled.com/search/?q=de+la+soul'
page = requests.get(webpage,
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'})
Proof:
>>> from lxml import html
>>> import requests
>>>
>>> webpage = 'http://www.whosampled.com/search/?q=de+la+soul'
>>> page = requests.get(webpage, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'})
>>> tree = html.fromstring(page.content)
>>> tree.findtext('.//title')
Search Results for "de la soul" | WhoSampled
FYI, it would also work if you switch to https:
>>> webpage = 'https://www.whosampled.com/search/?q=de+la+soul'
>>> page = requests.get(webpage)
>>> tree = html.fromstring(page.content)
>>> tree.findtext('.//title')
'Search Results for "de la soul" | WhoSampled'
Upvotes: 1