Rorschach
Rorschach

Reputation: 3812

urllib2 pulls a different page arbitrarily

I was trying to make a bs4 scraper for this url, when I realized that it worked sometimes and not others seemingly arbitrarily.

So, I made some code here (which you don't have to read all of):

import urllib2
import sys
from bs4 import BeautifulSoup

class RedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        result = urllib2.HTTPError(req.get_full_url(), code, msg, headers, fp)
        result.status = code
        return result

def pullPage():
    url = "http://shop.nordstrom.com/s/tory-burch-caroline-ballerina-flat-women/3152313?origin=category-personalizedsort&contextualcategoryid=0&fashionColor=Camellia+Pink+Beige&resultback=441"
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'none',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive'} 
    req = urllib2.Request(url,headers=hdr)
    try:
        opener = urllib2.build_opener(RedirectHandler())
        webpage = opener.open(req)
        soup = BeautifulSoup(webpage, "html5lib")
        return str(soup)
    except Exception,e:
        print str(e)
        if '403' in str(e):
            sys.exit("This scraper is forbidden from this site")
        elif '[Errno -2]' in str(e):
            sys.exit("This program can not connect to the internet")
        sys.exit('Broken URL')

happy = 1
while(happy < 10):
    print len(pullPage())
    happy = happy + 1

This program prints out the number of characters in the HTML of the website 10 times. Here is the ouput

218531
218524
377646
218551
377646
218559
218547
376938
218552

Does anyone know why this website seemingly almost doubles in code sometimes and not others? Is there some way to wait until the whole page loads?

The code the focus on I believe is these lines:

webpage = opener.open(req)
soup = BeautifulSoup(webpage, "html5lib")

Edit 1: Could someone else run this code and let me know if their results are similar?

Edit 2: I have rerun this code on a separate machine (on a google server) getting similar results of:

218565
218564
376937
376487
378243
218564
218557
378248
377791

Upvotes: 0

Views: 29

Answers (1)

6502
6502

Reputation: 114559

There could be many reasons:

  • May be they're using A/B testing to check a variation
  • May be they've a layered structure and not all the back-end servers are aligned
  • May be they want to stop others to steal and res-sell the catalog
  • May be you are behind a proxy that is having fun
  • May be some antivirus software is trying to help you
  • May be your machine is infected by a virus that injects html content

Upvotes: 2

Related Questions