Reputation: 45
I'm scraping some HTML that is formatted like this:
<div class="doccontent">
<h3> Section Title 1 </h3>
<div class="line"> My first line </div>
<div class="line> My second line </div>
<div class="linenumber"> text i don't need </div>
<h3> Section Title 2 </h3>
<div class="line"> My third line </div>
<div class="chapter">Chapter four</div>
<div class="line> My fourth line </div>
</div>
I only want to capture the h3 and class="line" text. I tried two ways. The first:
for lines in full_text:
for booktitle in lines.find("h3"):
linesArr.append(booktitle)
for line in lines.find_all(class_='line'):
linesArr.append(line)
This appends all booktitles to the beginning of the list, then starts working on the lines.
The second:
for lines in full_text:
for line in lines.find_all(['h3', class_="line"]):
linesArr.append(line)
The second seems more promising to me, but there is a syntax error.The BS4 documentation doesn't cover how to search for a list of tags and classes. Any help with be appreciated.
Upvotes: 1
Views: 96
Reputation: 84465
As mentioned in comments you can use css Or syntax to specify multiple css selectors and pass those to select
data = [item.text for item in soup.select("h3 , .line")]
Upvotes: 2