BeautifulSoup sometimes gives exceptions

The strange thing is that sometimes the BeautifulSoup object does give the desired data, but other times I get an error like or listindex error or out of range or nonetype object does not have attribute findNext(), which is data that is nested inside other elements.

This is the code :

url = 'http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)

a = soup.find(text=('Socket')).find_next('dd').string

print(a)

Upvotes: 5

Answers (3)

alecxe

Reputation: 473873

The actual problem is that the cell value is not always Socket, sometimes it is surrounded with tabs or spaces. Instead of checking for the exact text match, pass a compiled regular expression pattern:

import re

soup.find(text=re.compile('Socket')).find_next('dd').get_text(strip=True)

Always prints 1150.

Explaining that "sometimes" word I've used (thanks to @carpetsmoker for the initial proposal in comments):

if you open up the page, then, clean up the cookies and refresh the page, you may see two different looks of the same page:

As you can see, the blocks on the page are arranged differently. Hence, the same page has two different looks and the HTML source - what you see is an AB-testing technique:

In marketing and business intelligence, A/B testing is jargon for a randomized experiment with two variants, A and B, which are the control and treatment in the controlled experiment. It is a form of statistical hypothesis testing with two variants leading to the technical term, Two-sample hypothesis testing, used in the field of statistics.

In other words, they are experimenting with the product page and gathering stats, like click-rate, number of sales made etc.

FYI, Here's the working code I've got at the moment:

import re

from bs4 import BeautifulSoup
import requests

session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'}
session.get('http://www.computerstore.nl', headers=headers)

response = session.get('http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html', headers=headers)
soup = BeautifulSoup(response.content)
print(soup.find(text=re.compile('Socket')).find_next('dd').get_text(strip=True))

Upvotes: 3

drsnark

Reputation: 3033

I made a suggested change to your code:

url = 'http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)

if soup.find(text=('Socket')):
   a = soup.find(text=('Socket')).find_next('dd').string
else:
   # Display some error info, and/or do some error logging
   print "error"

print(a)

Upvotes: -1

Aaron Digulla

Reputation: 328614

This means that the data returned by the store doesn't contain the elements you seek for some reason.

Add some proper error handling to the code which catches the exceptions and dumps the input when it breaks. That way, you can see what was downloaded and improve the code.

A first step would be:

try:
    a = soup.find(text=('Socket')).find_next('dd').string

    print(a)
except:
    print(plain_text)
    raise

If it's a lot of text, then write it to a file.

It's also dangerous to string so many operations in a single line. If something goes wrong, then you won't know what. Split this into several lines, so you can quickly see whether it could find Socket or the dd element, etc.

Upvotes: -1

BeautifulSoup sometimes gives exceptions

Answers (3)

Related Questions