Reputation: 825
The strange thing is that sometimes the BeautifulSoup object does give the desired data, but other times I get an error like or listindex error
or out of range
or nonetype object does not have attribute findNext()
, which is data that is nested inside other elements.
This is the code :
url = 'http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
a = soup.find(text=('Socket')).find_next('dd').string
print(a)
Upvotes: 5
Views: 753
Reputation: 473873
The actual problem is that the cell value is not always Socket
, sometimes it is surrounded with tabs or spaces. Instead of checking for the exact text
match, pass a compiled regular expression pattern:
import re
soup.find(text=re.compile('Socket')).find_next('dd').get_text(strip=True)
Always prints 1150
.
Explaining that "sometimes" word I've used (thanks to @carpetsmoker for the initial proposal in comments):
if you open up the page, then, clean up the cookies and refresh the page, you may see two different looks of the same page:
As you can see, the blocks on the page are arranged differently. Hence, the same page has two different looks and the HTML source - what you see is an AB-testing technique:
In marketing and business intelligence, A/B testing is jargon for a randomized experiment with two variants, A and B, which are the control and treatment in the controlled experiment. It is a form of statistical hypothesis testing with two variants leading to the technical term, Two-sample hypothesis testing, used in the field of statistics.
In other words, they are experimenting with the product page and gathering stats, like click-rate, number of sales made etc.
FYI, Here's the working code I've got at the moment:
import re
from bs4 import BeautifulSoup
import requests
session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'}
session.get('http://www.computerstore.nl', headers=headers)
response = session.get('http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html', headers=headers)
soup = BeautifulSoup(response.content)
print(soup.find(text=re.compile('Socket')).find_next('dd').get_text(strip=True))
Upvotes: 3
Reputation: 3033
I made a suggested change to your code:
url = 'http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
if soup.find(text=('Socket')):
a = soup.find(text=('Socket')).find_next('dd').string
else:
# Display some error info, and/or do some error logging
print "error"
print(a)
Upvotes: -1
Reputation: 328614
This means that the data returned by the store doesn't contain the elements you seek for some reason.
Add some proper error handling to the code which catches the exceptions and dumps the input when it breaks. That way, you can see what was downloaded and improve the code.
A first step would be:
try:
a = soup.find(text=('Socket')).find_next('dd').string
print(a)
except:
print(plain_text)
raise
If it's a lot of text, then write it to a file.
It's also dangerous to string so many operations in a single line. If something goes wrong, then you won't know what. Split this into several lines, so you can quickly see whether it could find Socket
or the dd
element, etc.
Upvotes: -1