Reputation: 69
Is there a way to scrape p tags that do not contain multiple classes? Here's my code so far (after compiling codes and researching StackOverflow):
import requests
import bs4
import re
url = 'https://www.sp2.upenn.edu/person/amy-hillier/'
req = requests.get(url).text
soup = bs4.BeautifulSoup(req,'html.parser')
regex = re.compile('^((?!Header|header|button|Root|root|logo|Title|title|Foot|foot|Publish|Story|story|Stories|stories|Link|link|color|space|email|address|download|capital).)*$')
for texts in soup.find_all('div'):
for i in texts.findAll('p',{'class': regex}):
print(i)
So my thought process is that I've created a regex to list strings that if exist, then the web scraper will not scrape the paragraph. To put it simply, if any of these words pop up on the class section, then don't scrape them
.
Someone also recommend me to use a css selector syntax with :not() pseudo class and * contains operator, which I interpreted as:
for texts in soup.find_all('div'):
for i in texts.select('p[class]:not([class*="Header|header|button|Root|root|logo|Title|title|Foot|foot|Publish|Story|story|Stories|stories|Link|link|color|space|email|address|download|capital"])'):
print(i)
Unfortunately, neither of them works. Any help is greatly appreciated!
Edit Adding examples of text:
<p class="sub has-white-color has-normal-font-size tw-pb-5">
The world needs leaders equipped with tools to make a difference. The School of Social Policy & Practice (SP2) will prepare you to become one of those leaders, as a policy maker, practitioner, educator, activist, and more.
</p>
<p>
Amy Hillier (she/her/her) currently teaches introductory-level GIS (mapping) courses for SP2 and Urban Studies program and chairs the MSW racism course sequence. Her doctoral and post-doctoral research focused on historical mortgage redlining. For more than a decade, her research focused on links between the built environment and public health. During that time, her primary faculty position was with the Department of City & Regional Planning in the Weitzman School of Design. She moved to SP2 in 2017 in order to pursue new research interests relating to LGBTQ communities, particularly trans youth. She is the founding director of the LGBTQ Certificate.
</p>
<p class="Paragraph-sc-1mxv4ns-0 bGbcwt">
Dr. Sahingur joined Penn Dental Medicine in September 2019 as Associate Dean of Graduate Studies and Student Research, providing leadership, strategic vision, and oversight to support and expand the graduate studies and student research endeavors at the School. She will be overseeing the Summer Student Research Program for the summer of 2020. Originally from Istanbul, Turkey, she received her DDS from Istanbul University, Turkey, in 1994 and then moved to the U.S. for her postgraduate education. She completed all of her postgraduate training at State University of New York at Buffalo, receiving a Master of Science degree in Oral Sciences in 1999 and then a PhD in Oral Biology with a clinical certificate in Periodontics in 2004.
</p>
I need to scrape the second and third paragraphs. My logic is since the first paragraph's class has the word 'color' in it, I can exclude that. The rest of the words that I listed on the regex variable are pretty much the words that I have found and needed to be excluded across multiple URLs. I hope that clarifies my question.
Upvotes: 1
Views: 62
Reputation: 195408
Perhaps you can use custom function when searching for the right <p>
tags. For example:
from bs4 import BeautifulSoup
html_doc = """\
<p class="sub has-white-color has-normal-font-size tw-pb-5">
The world needs leaders equipped with tools to make a difference. The School of Social Policy & Practice (SP2) will prepare you to become one of those leaders, as a policy maker, practitioner, educator, activist, and more.
</p>
<p>
Amy Hillier (she/her/her) currently teaches introductory-level GIS (mapping) courses for SP2 and Urban Studies program and chairs the MSW racism course sequence. Her doctoral and post-doctoral research focused on historical mortgage redlining. For more than a decade, her research focused on links between the built environment and public health. During that time, her primary faculty position was with the Department of City & Regional Planning in the Weitzman School of Design. She moved to SP2 in 2017 in order to pursue new research interests relating to LGBTQ communities, particularly trans youth. She is the founding director of the LGBTQ Certificate.
</p>
<p class="Paragraph-sc-1mxv4ns-0 bGbcwt">
Dr. Sahingur joined Penn Dental Medicine in September 2019 as Associate Dean of Graduate Studies and Student Research, providing leadership, strategic vision, and oversight to support and expand the graduate studies and student research endeavors at the School. She will be overseeing the Summer Student Research Program for the summer of 2020. Originally from Istanbul, Turkey, she received her DDS from Istanbul University, Turkey, in 1994 and then moved to the U.S. for her postgraduate education. She completed all of her postgraduate training at State University of New York at Buffalo, receiving a Master of Science degree in Oral Sciences in 1999 and then a PhD in Oral Biology with a clinical certificate in Periodontics in 2004.
</p>"""
soup = BeautifulSoup(html_doc, "html.parser")
words = ["color"]
for p in soup.find_all(
lambda t: t.name == "p"
and all(w not in c.lower() for c in t.get("class", []) for w in words)
):
print(p)
print("-" * 80)
Prints:
<p>
Amy Hillier (she/her/her) currently teaches introductory-level GIS (mapping) courses for SP2 and Urban Studies program and chairs the MSW racism course sequence. Her doctoral and post-doctoral research focused on historical mortgage redlining. For more than a decade, her research focused on links between the built environment and public health. During that time, her primary faculty position was with the Department of City & Regional Planning in the Weitzman School of Design. She moved to SP2 in 2017 in order to pursue new research interests relating to LGBTQ communities, particularly trans youth. She is the founding director of the LGBTQ Certificate.
</p>
--------------------------------------------------------------------------------
<p class="Paragraph-sc-1mxv4ns-0 bGbcwt">
Dr. Sahingur joined Penn Dental Medicine in September 2019 as Associate Dean of Graduate Studies and Student Research, providing leadership, strategic vision, and oversight to support and expand the graduate studies and student research endeavors at the School. She will be overseeing the Summer Student Research Program for the summer of 2020. Originally from Istanbul, Turkey, she received her DDS from Istanbul University, Turkey, in 1994 and then moved to the U.S. for her postgraduate education. She completed all of her postgraduate training at State University of New York at Buffalo, receiving a Master of Science degree in Oral Sciences in 1999 and then a PhD in Oral Biology with a clinical certificate in Periodontics in 2004.
</p>
--------------------------------------------------------------------------------
Upvotes: 1