Reputation: 53
I am new to the beautifulSoup and here is a naive question I have when I want to scrape some information on university course websites. The html is as followed and I'd like to get the text between tags p but not tags p which have some children like <strong> and <em>
The text desired:This course introduces....
Really appreciate your help!
<p>
<strong>MSDS 402 Introduction to Data Science</strong>
</p >
<p>This course introduces.....</p >
<p>
<em>Prerequisites: None.</em>
</p >
<p><a aria-label="MSDS 402-DL Section, ID#: 4765" class="link-list" href=" ">View MSDS 402-DL Sections</a ></p >
Upvotes: 1
Views: 873
Reputation: 195458
You can use CSS selector p:not(:has(*))
that will select tags <p>
without any children tags.
For example:
from bs4 import BeautifulSoup
txt = '''<p>
<strong>MSDS 402 Introduction to Data Science</strong>
</p >
<p>This course introduces.....</p >
<p>
<em>Prerequisites: None.</em>
</p >
<p><a aria-label="MSDS 402-DL Section, ID#: 4765" class="link-list" href=" ">View MSDS 402-DL Sections</a ></p >'''
soup = BeautifulSoup(txt, 'html.parser')
for p in soup.select('p:not(:has(*))'):
print(p)
Prints:
<p>This course introduces.....</p>
Upvotes: 2