Reputation: 50
I am currently trying to get only the HTML text (a list of names) that is between the first two occurrences of the strong tag.
Here is a short example of the HTML I scraped
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
....
....
....
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
....
....
....
....
....
Hers is some quick code that I wrote with the basic logic of counting the number of strong tags occurring. I know after the second occurrence all the names that I want have been collected
html = requests.get('https://www.somewebsite.com')
soup = BS(html.text, 'html.parser')
#Pull only the HTML from the article that I am interested in
notes = soup.find('div', attrs = {'id' : 'article'})
# Define a function to print true if a string contains <strong>
def findstrong(i):
return "</strong>" in i
# initialize a value for strong, after the second strong I know all the
# names I am interested in have been collected
strong_counts = 0
list_of_names = []
for i in range(len(notes)):
if strong_counts < 2:
note = notes.contents[i]
#make note string so we can use the findstrong function
note_2_str = str(note)
if findstrong(note_2_str) == False:
list_of_names.append(note)
else:
strong_counts += 1
The loop works and collects all the text before the first strong tag and everything after up until the next occurrence of the strong tag. i.e.
<h3>Title of Article</h3>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
....
....
....
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
This basically does what I want, but I lose some of the functionality of a BeautifulSoup object since it is now a list. Is there a BeautifulSoup function that can help me do this or another option? Or should I focus on making this loop more efficient before I scale it up to multiple articles?
Upvotes: 1
Views: 1228
Reputation: 84465
Based on assumptions of contains strings being present for use e.g. PRESENT:
. Produces list of names (names residing with p
elements). Requires bs 4.7.1 +
from bs4 import BeautifulSoup as bs
html = '''
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
<p>Other<p/>'''
soup = bs(html, 'lxml')
select_html = soup.select('p:contains("PRESENT:") ~ p:not(p:contains("Section Header 2") ~ p, p:contains("Section Header 2"))')
l = [y for x in [i.text.split('\n') for i in select_html] for y in x]
print(l)
Upvotes: 2
Reputation: 9430
To answer the question as is, leaving the opportunity to scrape the "Title of Article" and "Footnotes". You can use findChildren() then decompose() to remove unwanted elements. From the output of this code you can extract the data you need quite easily. It works even if the text "PRESENT" and "Section Header" are not present. It can easily be adapted to remove elements before the first "Strong" tag if needed.
from bs4 import BeautifulSoup, element
html = """
<div><p> blah blah</p></div>
<div id="article">
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
<p> blah blah</p>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
# Pull only the HTML from the article that I am interested in
notes = soup.find('div', attrs = {'id' : 'article'})
counter = 0
# Iterate over children.
for i in notes.findChildren():
if i.name == "strong":
counter += 1
if counter == 2:
i.parent.decompose() # Remove the second Strong tag's parent.
if counter > 1: # Remove all tags after second Strong tag.
if isinstance(i, element.Tag):
i.decompose()
print(notes)
Outputs:
<div id="article">
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
</div>
Upvotes: 1
Reputation: 1734
Based on the title Trying to get only the text between two strong tags
if this is truly what is wanted, you can use something like what is found below. We utilize CSS level 4 :has()
to test that an element contains certain elements, we use CSS level :nth-child(x of s)
to target a certain instance of a compound selector type (in our case 1st and 2nd p
tag with a strong
tag).
from bs4 import BeautifulSoup
html = '''
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
....
....
....
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
....
....
....
....
....
'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.select('p:nth-child(1 of :has(strong)) ~ *:has(~ p:nth-child(2 of :has(strong)))'))
Output:
[<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>, <p>PRESENT:</p>, <p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>, <p>George Jungle, Savage</p>, <p>William, Baller</p>, <p>Roy Williams, Coach</p>]
If we really want just the list of names though, we'd change the selector to start collecting elements after the paragraph that contains PRESENT:
:
soup.select('p:contains("PRESENT:") ~ *:has(~ p:nth-child(2 of :has(strong)))')
Output:
[<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>, <p>George Jungle, Savage</p>, <p>William, Baller</p>, <p>Roy Williams, Coach</p>]
At that point you can just extract the content you want.
Upvotes: 1