ArcTeryxOverAlteryx
ArcTeryxOverAlteryx

Reputation: 50

Trying to get only the text between two strong tags

I am currently trying to get only the HTML text (a list of names) that is between the first two occurrences of the strong tag.

Here is a short example of the HTML I scraped

<h3>Title of Article</h3>

<p><strong>Section Header 1</strong></p>

<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>

<p>PRESENT:</p>

<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>

....
....
....

<p>William, Baller</p>

<p>Roy Williams, Coach</p>

<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
....
....
....
....
....


Hers is some quick code that I wrote with the basic logic of counting the number of strong tags occurring. I know after the second occurrence all the names that I want have been collected

html = requests.get('https://www.somewebsite.com')
soup = BS(html.text, 'html.parser')

#Pull only the HTML from the article that I am interested in 
notes = soup.find('div', attrs = {'id' : 'article'})


# Define a function to print true if a string contains <strong>
def findstrong(i):
    return "</strong>" in i


# initialize a value for strong, after the second strong I know all the 
# names I am interested in have been collected 
strong_counts = 0



list_of_names = []
for i in range(len(notes)):

    if strong_counts < 2:

        note = notes.contents[i]
        #make note string so we can use the findstrong function
        note_2_str = str(note)

        if findstrong(note_2_str) == False:
            list_of_names.append(note)
        else:
            strong_counts += 1    

The loop works and collects all the text before the first strong tag and everything after up until the next occurrence of the strong tag. i.e.

<h3>Title of Article</h3>

<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>

<p>PRESENT:</p>

<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>

....
....
....

<p>William, Baller</p>

<p>Roy Williams, Coach</p>

This basically does what I want, but I lose some of the functionality of a BeautifulSoup object since it is now a list. Is there a BeautifulSoup function that can help me do this or another option? Or should I focus on making this loop more efficient before I scale it up to multiple articles?

Upvotes: 1

Views: 1228

Answers (3)

QHarr
QHarr

Reputation: 84465

Based on assumptions of contains strings being present for use e.g. PRESENT:. Produces list of names (names residing with p elements). Requires bs 4.7.1 +

from bs4 import BeautifulSoup as bs

html = '''
<h3>Title of Article</h3>    
<p><strong>Section Header 1</strong></p>    
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>   
<p>PRESENT:</p>   
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
<p>Other<p/>'''

soup = bs(html, 'lxml')
select_html = soup.select('p:contains("PRESENT:") ~ p:not(p:contains("Section Header 2") ~ p, p:contains("Section Header 2"))')
l = [y for x in [i.text.split('\n') for i in select_html] for y in x]
print(l)

enter image description here

Upvotes: 2

Dan-Dev
Dan-Dev

Reputation: 9430

To answer the question as is, leaving the opportunity to scrape the "Title of Article" and "Footnotes". You can use findChildren() then decompose() to remove unwanted elements. From the output of this code you can extract the data you need quite easily. It works even if the text "PRESENT" and "Section Header" are not present. It can easily be adapted to remove elements before the first "Strong" tag if needed.

from bs4 import BeautifulSoup, element

html = """
<div><p> blah blah</p></div>
<div id="article">
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
<p> blah blah</p>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')
# Pull only the HTML from the article that I am interested in
notes = soup.find('div', attrs = {'id' : 'article'})
counter = 0
# Iterate over children.
for i in notes.findChildren():
    if i.name == "strong":
        counter += 1
        if counter == 2:
            i.parent.decompose()  # Remove the second Strong tag's parent.
    if counter > 1:  # Remove all tags after second Strong tag.
        if isinstance(i, element.Tag):
            i.decompose()
print(notes)

Outputs:

<div id="article">
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>


</div>

Upvotes: 1

facelessuser
facelessuser

Reputation: 1734

Based on the title Trying to get only the text between two strong tags if this is truly what is wanted, you can use something like what is found below. We utilize CSS level 4 :has() to test that an element contains certain elements, we use CSS level :nth-child(x of s) to target a certain instance of a compound selector type (in our case 1st and 2nd p tag with a strong tag).

from bs4 import BeautifulSoup

html = '''
<h3>Title of Article</h3>

<p><strong>Section Header 1</strong></p>

<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>

<p>PRESENT:</p>

<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>

....
....
....

<p>William, Baller</p>

<p>Roy Williams, Coach</p>

<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
....
....
....
....
....
'''

soup = BeautifulSoup(html, 'html.parser')
print(soup.select('p:nth-child(1 of :has(strong)) ~ *:has(~ p:nth-child(2 of :has(strong)))'))

Output:

[<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>, <p>PRESENT:</p>, <p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>, <p>George Jungle, Savage</p>, <p>William, Baller</p>, <p>Roy Williams, Coach</p>]

If we really want just the list of names though, we'd change the selector to start collecting elements after the paragraph that contains PRESENT::

soup.select('p:contains("PRESENT:") ~ *:has(~ p:nth-child(2 of :has(strong)))')

Output:

[<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>, <p>George Jungle, Savage</p>, <p>William, Baller</p>, <p>Roy Williams, Coach</p>]

At that point you can just extract the content you want.

Upvotes: 1

Related Questions