dotancohen
dotancohen

Reputation: 31471

Beautiful Soup not finding string

When parsing http://en.wikipedia.org/wiki/Israel I encounter an H2 tag which has text, yet Beautiful Soup returns a None type for it:

$ python
Python 2.7.3 (default, Apr 10 2013, 05:13:16)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import bs4
>>> import requests
>>> from pprint import pprint
>>> response = requests.get('http://en.wikipedia.org/wiki/Israel')
>>> soup = bs4.BeautifulSoup(response.content)
>>> for h in soup.find_all('h2'):
...     pprint(str(type(h)))
...     pprint(h)
...     pprint(str(type(h.string)))
...     pprint(h.string)
...     print('--')
...                     
"<class 'bs4.element.Tag'>"
<h2>Contents</h2>    
"<class 'bs4.element.NavigableString'>"
u'Contents'          
--                   
"<class 'bs4.element.Tag'>"
<h2><span class="mw-headline" id="Etymology"><span id="Etymology"></span> Etymology</span></h2>
"<type 'NoneType'>"  
None                 
--                   
"<class 'bs4.element.Tag'>"
<h2><span class="mw-headline" id="History">History</span></h2>
"<class 'bs4.element.NavigableString'>"
u'History'           
--

Note that this is not a parsing issue, Beautiful Soup parses the document just fine. Why is the second H2 element returning a None type? Is it due to the leading " " (space) in the string? How can I work around this? This is with Beautiful Soup 4 on Python 2.7, Kubuntu Linux 12.10.

Upvotes: 2

Views: 3268

Answers (2)

thkang
thkang

Reputation: 11533

I thinks this is because the second h2 has no text, instead it has a span as a child (and that span has another child as its child which makes that h2's grandchild.

for this kind of parsing use generator-based attributes like .stripped_strings and .strings.

>>> s.find_all('h2')
[<h2>Contents</h2>, <h2><span class="mw-headline" id="Etymology"><span id="Etymology"></span> Etymology</span></h2>]
>>> list(s.find_all('h2')[-1].stripped_strings)
[u'Etymology']

Upvotes: 1

nickie
nickie

Reputation: 5808

I'm answering first the first half, what's wrong...

Quoting from the documentation of bs4: "If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None."

And now the other half, how to fix it.

Quoting again from the same source: "If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator.". Better still, use the .stripped_strings generator, concatenate the results and I think you'll get what you want.

Upvotes: 2

Related Questions