Reputation: 5106
I'm trying to get articles from wired.com. Generally their articles' content look like this:
<article itemprop="articleBody">
<p>Some text</p>
<p>Next text</p>
<p>...</p>
<p>...</p>
</article>
or like this:
<article itemprop="articleBody">
<div class="listicle-captions marg-t...">
<p></p>
</div>
</article>
So I want if the page is of type 1, the <p>
and <h>
are extracted, while if the page is of type 2 - do something else. So, if the <p>
and <h>
are direct descendants of <article>
, then it's type 1.
I tried the following code, it looks for <p>
and <h>
and prints out the tag names. The thing is, the recursive="False"
doesn't seem to help because when tested on type 2 page, it finds the tags, while it shouldn't (I espected to get a NonType
object).
import urllib.request
from bs4 import BeautifulSoup
import datetime
import html
import sys
articleUrl="https://www.wired.com/2016/07/greatest-feats-inventions-100-years-boeing/"
soupArticle=BeautifulSoup(urllib.request.urlopen(articleUrl), "html.parser")
articleBody=soupArticle.find("article", {"itemprop":"articleBody"})
articleContentTags=articleBody.findAll(["h1", "h2","h3", "p"], recursive="False")
for tag in articleContentTags:
print(tag.name)
print(tag.parent.encode("utf-8"))
Why doesn't it work?
PS Also, is there a difference between using findAll
and findChildren
in general and in this particular case? These two look the same to me..
Upvotes: 4
Views: 14182
Reputation: 180401
The string literal "False"
is not the same as use the boolean False
, you need to actually pass recursive=False
:
articleBody.find_all(["h1", "h2","h3", "p"], recursive=False)
Any non empty string is going to be considered a truthy value , the only string you could pass that would work would be an empty string i.e recursive=""
.
In [17]: bool("False")
Out[17]: True
In [18]: bool("foo")
Out[18]: True
In [19]: bool("")
Out[19]: False
But stick to using the actual boolean False
, also you will get an empty list/ResultSet returned with recursive=False
, not None as you are calling find_all not find.
Upvotes: 9