urllib2 pulls a different page arbitrarily

Question

I was trying to make a bs4 scraper for this url, when I realized that it worked sometimes and not others seemingly arbitrarily.

So, I made some code here (which you don't have to read all of):

import urllib2
import sys
from bs4 import BeautifulSoup

class RedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        result = urllib2.HTTPError(req.get_full_url(), code, msg, headers, fp)
        result.status = code
        return result

def pullPage():
    url = "http://shop.nordstrom.com/s/tory-burch-caroline-ballerina-flat-women/3152313?origin=category-personalizedsort&contextualcategoryid=0&fashionColor=Camellia+Pink+Beige&resultback=441"
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'none',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive'} 
    req = urllib2.Request(url,headers=hdr)
    try:
        opener = urllib2.build_opener(RedirectHandler())
        webpage = opener.open(req)
        soup = BeautifulSoup(webpage, "html5lib")
        return str(soup)
    except Exception,e:
        print str(e)
        if '403' in str(e):
            sys.exit("This scraper is forbidden from this site")
        elif '[Errno -2]' in str(e):
            sys.exit("This program can not connect to the internet")
        sys.exit('Broken URL')

happy = 1
while(happy < 10):
    print len(pullPage())
    happy = happy + 1

This program prints out the number of characters in the HTML of the website 10 times. Here is the ouput

Does anyone know why this website seemingly almost doubles in code sometimes and not others? Is there some way to wait until the whole page loads?

The code the focus on I believe is these lines:

webpage = opener.open(req)
soup = BeautifulSoup(webpage, "html5lib")

Edit 1: Could someone else run this code and let me know if their results are similar?

Edit 2: I have rerun this code on a separate machine (on a google server) getting similar results of:

6502 · Accepted Answer

There could be many reasons:

May be they're using A/B testing to check a variation
May be they've a layered structure and not all the back-end servers are aligned
May be they want to stop others to steal and res-sell the catalog
May be you are behind a proxy that is having fun
May be some antivirus software is trying to help you
May be your machine is infected by a virus that injects html content

urllib2 pulls a different page arbitrarily

Answers (1)

Related Questions