alphazwest
alphazwest

Reputation: 4450

Get Certain Tags Within Parent Tag Using Beautifulsoup4

I am using beautifulsoup4 with Python to scrape content from the web, with which I am attempting to extract content from specific html tags, while ignoring others.

I have the following html:

<div class="the-one-i-want">
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <div class="random-inserted-element-i-dont-want">
        <content>
    </div>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
</div>

My goal is to understand how to instruct python to only get the <p> elements from within the parent <div> class="the-one-i-want">, otherwise ignoring all the <div>'s within.

Currently, I am locating the content of the parent div by the following method:

content = soup.find('div', class_='the-one-i-want')

However, I can't seem to figure out how to further specify to only extract the <p> tags from that without error.

Upvotes: 1

Views: 1607

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

h = """<div class="the-one-i-want">
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <div class="random-inserted-element-i-dont-want">
        <content>
    </div>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
</div>"""

You can just use find_all("p") after you find:

from bs4 import BeautifulSoup
soup = BeautifulSoup(h)

print(soup.find("div","the-one-i-want").find_all("p"))

Or use a css select:

print(soup.select("div.the-one-i-want p"))

Both will give you:

[<p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>]

find_all will only find descendants of the div with the class the-one-i-want, the same applies to our select

Upvotes: 3

Related Questions