Reputation: 31471
When parsing http://en.wikipedia.org/wiki/Israel
I encounter an H2
tag which has text, yet Beautiful Soup returns a None
type for it:
$ python
Python 2.7.3 (default, Apr 10 2013, 05:13:16)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import bs4
>>> import requests
>>> from pprint import pprint
>>> response = requests.get('http://en.wikipedia.org/wiki/Israel')
>>> soup = bs4.BeautifulSoup(response.content)
>>> for h in soup.find_all('h2'):
... pprint(str(type(h)))
... pprint(h)
... pprint(str(type(h.string)))
... pprint(h.string)
... print('--')
...
"<class 'bs4.element.Tag'>"
<h2>Contents</h2>
"<class 'bs4.element.NavigableString'>"
u'Contents'
--
"<class 'bs4.element.Tag'>"
<h2><span class="mw-headline" id="Etymology"><span id="Etymology"></span> Etymology</span></h2>
"<type 'NoneType'>"
None
--
"<class 'bs4.element.Tag'>"
<h2><span class="mw-headline" id="History">History</span></h2>
"<class 'bs4.element.NavigableString'>"
u'History'
--
Note that this is not a parsing issue, Beautiful Soup parses the document just fine. Why is the second H2
element returning a None
type? Is it due to the leading " " (space) in the string? How can I work around this? This is with Beautiful Soup 4 on Python 2.7, Kubuntu Linux 12.10.
Upvotes: 2
Views: 3268
Reputation: 11533
I thinks this is because the second h2
has no text, instead it has a span
as a child (and that span has another child as its child which makes that h2
's grandchild.
for this kind of parsing use generator-based attributes like .stripped_strings
and .strings
.
>>> s.find_all('h2')
[<h2>Contents</h2>, <h2><span class="mw-headline" id="Etymology"><span id="Etymology"></span> Etymology</span></h2>]
>>> list(s.find_all('h2')[-1].stripped_strings)
[u'Etymology']
Upvotes: 1
Reputation: 5808
I'm answering first the first half, what's wrong...
Quoting from the documentation of bs4: "If a tag contains more than one thing, then it’s not clear what .string
should refer to, so .string
is defined to be None
."
And now the other half, how to fix it.
Quoting again from the same source: "If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings
generator.". Better still, use the .stripped_strings
generator, concatenate the results and I think you'll get what you want.
Upvotes: 2