Reputation: 197
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.amazon.com/dp/B00IOXUJRY'
page = BeautifulSoup(urllib2.urlopen(url))
print page
title = page.find(id='productTitle') #.text.replace('\t','').strip()
print repr(title)
if I try to get text of this prodcutTitle id, it returns None
. although i print the page value and check whether this is a static text or comming from javascript/ajax. I've already spent 1 hour on this but unable to find the reason. May be I'm doing a very small silly mistake I'm not aware of?
PS: I have is one more query. there is a section "product description" below "Important Information" section. this is javascript generated content(I think so??). So I have to use selenium/phantomJS kind of library. Is there any way to get this content from beautifulsoup or python's inbuilt library (because selenium is too slow) any other library like mechanize or robobrowser,etc?
Upvotes: 0
Views: 153
Reputation: 473893
You are experiencing the differences between parsers used by BeautifulSoup
under the hood.
Since you haven't specified it explicitly, BeautifulSoup
chooses one automatically:
The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you’d like the markup parsed.
If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.
Here's the demo of what is happening:
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.amazon.com/dp/B00IOXUJRY'
>>> page = BeautifulSoup(urllib2.urlopen(url), 'html.parser')
>>> print page.find(id='productTitle')
None
>>> page = BeautifulSoup(urllib2.urlopen(url), 'html5lib')
>>> print page.find(id='productTitle')
<span class="a-size-large" id="productTitle">Keurig, The Original Donut Shop, K-Cup packs (Regular - Medium Roast Extra Bold, 24 Count)</span>
>>> page = BeautifulSoup(urllib2.urlopen(url), 'lxml')
>>> print page.find(id='productTitle')
<span class="a-size-large" id="productTitle">Keurig, The Original Donut Shop, K-Cup packs (Regular - Medium Roast Extra Bold, 24 Count)</span>
In other words, the solution would be to explicitly specify the parser, either html5lib
, or lxml
- but make sure you have these modules installed.
To get the product description, you don't need to use selenium+PhantomJS
approach. You can get it using BeautifulSoup
:
print page.find('div', class_='productDescriptionWrapper').text.strip()
Prints:
Coffee People Donut Shop K-Cup Coffee is a medium roast coffee reminiscent of the cup of joe that you find at classic donut counters throughout the United States. Sweet and rich with dessert flavors in every single cup, this classic coffee is approachable even to those who fear coffee bitters. Sweet savory flavor set Coffee People Donut Shop coffees apart from your average coffee blends, and now you can enjoy this unique coffee with the convenience of single serve K-Cup refills. Includes 24 K-Cups.
Upvotes: 1