Reputation: 3812
I was trying to make a bs4 scraper for this url, when I realized that it worked sometimes and not others seemingly arbitrarily.
So, I made some code here (which you don't have to read all of):
import urllib2
import sys
from bs4 import BeautifulSoup
class RedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
result = urllib2.HTTPError(req.get_full_url(), code, msg, headers, fp)
result.status = code
return result
def pullPage():
url = "http://shop.nordstrom.com/s/tory-burch-caroline-ballerina-flat-women/3152313?origin=category-personalizedsort&contextualcategoryid=0&fashionColor=Camellia+Pink+Beige&resultback=441"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(url,headers=hdr)
try:
opener = urllib2.build_opener(RedirectHandler())
webpage = opener.open(req)
soup = BeautifulSoup(webpage, "html5lib")
return str(soup)
except Exception,e:
print str(e)
if '403' in str(e):
sys.exit("This scraper is forbidden from this site")
elif '[Errno -2]' in str(e):
sys.exit("This program can not connect to the internet")
sys.exit('Broken URL')
happy = 1
while(happy < 10):
print len(pullPage())
happy = happy + 1
This program prints out the number of characters in the HTML of the website 10 times. Here is the ouput
218531
218524
377646
218551
377646
218559
218547
376938
218552
Does anyone know why this website seemingly almost doubles in code sometimes and not others? Is there some way to wait until the whole page loads?
The code the focus on I believe is these lines:
webpage = opener.open(req)
soup = BeautifulSoup(webpage, "html5lib")
Edit 1: Could someone else run this code and let me know if their results are similar?
Edit 2: I have rerun this code on a separate machine (on a google server) getting similar results of:
218565
218564
376937
376487
378243
218564
218557
378248
377791
Upvotes: 0
Views: 29
Reputation: 114559
There could be many reasons:
Upvotes: 2