parsecer
parsecer

Reputation: 5106

Beautifulsoup: findAll recursive doesn't work

I'm trying to get articles from wired.com. Generally their articles' content look like this:

<article itemprop="articleBody">
   <p>Some text</p>
   <p>Next text</p>
   <p>...</p>
   <p>...</p>
</article>

or like this:

<article itemprop="articleBody">
    <div class="listicle-captions marg-t...">
        <p></p>

    </div>

 </article>

So I want if the page is of type 1, the <p> and <h> are extracted, while if the page is of type 2 - do something else. So, if the <p> and <h> are direct descendants of <article>, then it's type 1. I tried the following code, it looks for <p> and <h> and prints out the tag names. The thing is, the recursive="False" doesn't seem to help because when tested on type 2 page, it finds the tags, while it shouldn't (I espected to get a NonType object).

import urllib.request
from bs4 import BeautifulSoup
import datetime
import html
import sys

articleUrl="https://www.wired.com/2016/07/greatest-feats-inventions-100-years-boeing/"

soupArticle=BeautifulSoup(urllib.request.urlopen(articleUrl), "html.parser")

articleBody=soupArticle.find("article", {"itemprop":"articleBody"})
articleContentTags=articleBody.findAll(["h1", "h2","h3", "p"], recursive="False")

for tag in articleContentTags:
    print(tag.name)
    print(tag.parent.encode("utf-8"))

Why doesn't it work?

PS Also, is there a difference between using findAll and findChildren in general and in this particular case? These two look the same to me..

Upvotes: 4

Views: 14182

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180401

The string literal "False" is not the same as use the boolean False, you need to actually pass recursive=False:

articleBody.find_all(["h1", "h2","h3", "p"], recursive=False)

Any non empty string is going to be considered a truthy value , the only string you could pass that would work would be an empty string i.e recursive="".

In [17]: bool("False")
Out[17]: True

In [18]: bool("foo")
Out[18]: True

In [19]: bool("")
Out[19]: False

But stick to using the actual boolean False, also you will get an empty list/ResultSet returned with recursive=False, not None as you are calling find_all not find.

Upvotes: 9

Related Questions