Reputation: 879
I am using urllib2
in Python
to scrape a webpage. However, the read()
method does not return.
Here is the code I am using:
import urllib2
url = 'http://edmonton.en.craigslist.ca/kid/'
headers = {'User-Agent': 'Mozilla/5.0'}
request = urllib2.Request(url, headers=headers)
f_webpage = urllib2.urlopen(request)
html = f_webpage.read() # <- does not return
I last ran the script a month ago and it was working fine then.
Note that the same script runs well for webpages of other categories on Edmonton Craigslist like http://edmonton.en.craigslist.ca/act/
or http://edmonton.en.craigslist.ca/eve/
.
Upvotes: 0
Views: 560
Reputation: 1
I met the similar problem with you. Part of my error information:
File "C:\Python27\lib\socket.py", line 380, in read
data = self._sock.recv(left)
File "C:\Python27\lib\httplib.py", line 573, in read
s = self.fp.read(amt)
File "C:\Python27\lib\socket.py", line 380, in read
data = self._sock.recv(left)
error: [Errno 10054]
I solve it by reading the buffer in small batches instead of reading directly.
def readBuf(fsrc, length=16*1024):
result=''
while 1:
buf = fsrc.read(length)
if not buf:
break
else:
result+=buf
return result
Instead of using html=f_webpage.read()
, you can use html=readBuf(f_webpage)
to scrape the webpage.
Upvotes: 0
Reputation: 5220
As requested in comments :)
Install requests
by $ pip install requests
Use requests
as the following:
>>> import requests
>>> url = 'http://edmonton.en.craigslist.ca/kid/'
>>> headers = {'User-Agent': 'Mozilla/5.0'}
>>> request = requests.get(url, headers=headers)
>>> request.ok
True
>>> request.text # content in string, similar to .read() in question
...
...
Disclaimer: this is not technically the answer to OP's question, but solves OP's problem as urllib2
is known to be problematic and requests
library is born to solve such problems.
Upvotes: 1
Reputation: 527063
It returns (or more specifically, errors out) fine for me:
>>> import urllib2
>>> url = 'http://edmonton.en.craigslist.ca/kid/'
>>> headers = {'User-Agent': 'Mozilla/5.0'}
>>> request = urllib2.Request(url, headers=headers)
>>> f_webpage = urllib2.urlopen(request)
>>> html = f_webpage.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.7/httplib.py", line 541, in read
return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 592, in _read_chunked
value.append(self._safe_read(amt))
File "/usr/lib/python2.7/httplib.py", line 647, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
socket.error: [Errno 104] Connection reset by peer
Chances are that Craigslist is detecting that you are a scraper and refusing to give you the actual page.
Upvotes: 0