Reputation: 47
I am trying to scrape a website that has paragraph content set off by h3 headers. The H3 headers have title content (e.g. make, model, year) and then the following text in a paragraph is what I'm after. Ideally, I'd love to identify the paragraph content using a class but it's all the same class.
Also, I should say, I'm new to this so please forgive stupid questions or inarticulate phrasing.
I've gotten to the point where I can pull the text, altogether but separated by the html paragraph tags. My thoughts are that, (1) if I could somehow turn the content between the paragraph tags into individual items in a list that I could write a loop against that list to amend those items into the database I'm building.
Alternatively, (2) I was wondering if there was some sort of way to pull the sibling paragraph text from paragraphs that follow headers which contain specific text (e.g. Ford 'Models'). I know you can do it with defined id's and classes, but can you identify a specific h3 tag based on the text it contains?
I've been watching youtube and reading forums non-stop for the past couple days. Any feedback, delivered as bluntly as you wish, would be GREATLY appreciated! I'm using beautifulsoup but am happy to do whatever is best for the job.
Thanks!
John
Here is the html -
<div class="entry-content bb">
<h4>Pedigree</h4>
<p>This is where the content is for pedigree that I am trying to scrape</p>
<h4>Breed</h4>
<p>This is where the content is for breed that I am trying to scrape</p>
<h4>Origin</h4>
<p>This is where the content is for origin that I am trying to scrape</p>
If I could identify with the class alone below is the code I would be using -
pedigree_temp = soup.find(class_='pedigree').text
pedigree_final.append(pedigree_temp)
Upvotes: 2
Views: 1211
Reputation: 84455
With bs4 4.7.1+ you can use :contains to target the content of the h4 tag then use an adjacent sibling combinator (+) to move to the adjacent p. You can use the following syntax to generate the appropriate selector from a list of headers items to search for:
from bs4 import BeautifulSoup as bs
html = '''<div class="entry-content bb">
<h4>Pedigree</h4>
<p>This is where the content is for pedigree that I am trying to scrape</p>
<h4>Breed</h4>
<p>This is where the content is for breed that I am trying to scrape</p>
<h4>Origin</h4>
<p>This is where the content is for origin that I am trying to scrape</p>'''
soup = bs(html, 'lxml')
headers = ['Pedigree', 'Breed']
selector = ', '.join([f'h4:contains("{header}") + p' for header in headers])
print([i.text for i in soup.select(selector)])
Upvotes: 0
Reputation: 550
If you want a list of text in all the paragraph tags from your soup, try:
[tag.text for tag in soup.select('p')]
Or to get a list of tags containing specific text:
import re
for elem in soup(text=re.compile('find me!')):
print elem.parent
Upvotes: 1