Reputation: 4450
I am using beautifulsoup4 with Python to scrape content from the web, with which I am attempting to extract content from specific html tags, while ignoring others.
I have the following html:
<div class="the-one-i-want">
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<div class="random-inserted-element-i-dont-want">
<content>
</div>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
</div>
My goal is to understand how to instruct python to only get the <p>
elements from within the parent <div> class="the-one-i-want">
, otherwise ignoring all the <div>
's within.
Currently, I am locating the content of the parent div by the following method:
content = soup.find('div', class_='the-one-i-want')
However, I can't seem to figure out how to further specify to only extract the <p>
tags from that without error.
Upvotes: 1
Views: 1607
Reputation: 180391
h = """<div class="the-one-i-want">
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<div class="random-inserted-element-i-dont-want">
<content>
</div>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
</div>"""
You can just use find_all("p")
after you find:
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
print(soup.find("div","the-one-i-want").find_all("p"))
Or use a css select:
print(soup.select("div.the-one-i-want p"))
Both will give you:
[<p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>]
find_all
will only find descendants of the div with the class the-one-i-want
, the same applies to our select
Upvotes: 3