How do you scrape specific paragraph text using find or select in Python?

Question

I am trying to scrape a website that has paragraph content set off by h3 headers. The H3 headers have title content (e.g. make, model, year) and then the following text in a paragraph is what I'm after. Ideally, I'd love to identify the paragraph content using a class but it's all the same class.

Also, I should say, I'm new to this so please forgive stupid questions or inarticulate phrasing.

I've gotten to the point where I can pull the text, altogether but separated by the html paragraph tags. My thoughts are that, (1) if I could somehow turn the content between the paragraph tags into individual items in a list that I could write a loop against that list to amend those items into the database I'm building.

Alternatively, (2) I was wondering if there was some sort of way to pull the sibling paragraph text from paragraphs that follow headers which contain specific text (e.g. Ford 'Models'). I know you can do it with defined id's and classes, but can you identify a specific h3 tag based on the text it contains?

I've been watching youtube and reading forums non-stop for the past couple days. Any feedback, delivered as bluntly as you wish, would be GREATLY appreciated! I'm using beautifulsoup but am happy to do whatever is best for the job.

Thanks!

John

Here is the html -

                                      
Pedigree
This is where the content is for pedigree that I am trying to scrape
Breed
This is where the content is for breed that I am trying to scrape
Origin
This is where the content is for origin that I am trying to scrape

If I could identify with the class alone below is the code I would be using -

pedigree_temp = soup.find(class_='pedigree').text
pedigree_final.append(pedigree_temp)

QHarr · Accepted Answer

With bs4 4.7.1+ you can use :contains to target the content of the h4 tag then use an adjacent sibling combinator (+) to move to the adjacent p. You can use the following syntax to generate the appropriate selector from a list of headers items to search for:

from bs4 import BeautifulSoup as bs

html = '''                                      
Pedigree
This is where the content is for pedigree that I am trying to scrape
Breed
This is where the content is for breed that I am trying to scrape
Origin
This is where the content is for origin that I am trying to scrape'''

soup = bs(html, 'lxml')
headers = ['Pedigree', 'Breed']
selector = ', '.join([f'h4:contains("{header}") + p' for header in headers])
print([i.text for i in soup.select(selector)])

How do you scrape specific paragraph text using find or select in Python?

Answers (2)

Related Questions