robocon20x
robocon20x

Reputation: 175

How to get only texts of tags that contain a certain string by using beautifulsoup?

Situation

Given is an unordered list with some list elements that contain the string "is" - I only want to get these texts:

<ul class="fun-facts">
    <li>Owned my dream car in high school <a href="#footer"><sup>1</sup></a></li>
    <li>Middle name is Ronald</li>
    <li>Never had been on a plane until college</li>
    <li>Dunkin Donuts coffee is better than Starbucks</li>
    <li>A favorite book series of mine is <i>Ender's Game</i></li>
    <li>Current video game of choice is <i>Rocket League</i></li>
    <li>The band that I've seen the most times live is the <i>Zac Brown Band</i></li>
</ul>

My approach

facts = webpage.select('ul.fun-facts li')

facts_with_is = [fact.find(string=re.compile('is')) for fact in facts]

facts_with_is1 = [fact for fact in facts_with_is if fact]

facts_with_is2 = [fact.find_parent().get_text() for fact in facts_with_is if fact]

Results

facts:

[<li>Owned my dream car in high school <a href="#footer"><sup>1</sup></a></li>, <li>Middle name is Ronald</li>, <li>Never had been on a plane until college</li>, <li>Dunkin Donuts coffee is better than Starbucks</li>, <li>A favorite book series of mine is <i>Ender's Game</i></li>, <li>Current video game of choice is <i>Rocket League</i></li>, <li>The band that I've seen the most times live is the <i>Zac Brown Band</i></li>]

facts_with_is1 (after filter None value of facts_with_is ):

['Middle name is Ronald', 'Dunkin Donuts coffee is better than Starbucks', 'A favorite book series of mine is ', 'Current video game of choice is ', "The band that I've seen the most times live is the "]

facts_with_is2:

['Middle name is Ronald', 'Dunkin Donuts coffee is better than Starbucks', "A favorite book series of mine is Ender's Game", 'Current video game of choice is Rocket League', "The band that I've seen the most times live is the Zac Brown Band"]

How can I get the expected result (fact_with_is2) with a simpler approach?

Upvotes: 0

Views: 51

Answers (1)

HedgeHog
HedgeHog

Reputation: 25048

solution bs4 only

Select all <li> and check in a loop if string is in string:

from bs4 import BeautifulSoup

html_text='''<ul class="fun-facts">
    <li>Owned my dream car in high school <a href="#footer"><sup>1</sup></a></li>
    <li>Middle name is Ronald</li>
    <li>Never had been on a plane until college</li>
    <li>Dunkin Donuts coffee is better than Starbucks</li>
    <li>A favorite book series of mine is <i>Ender's Game</i></li>
    <li>Current video game of choice is <i>Rocket League</i></li>
    <li>The band that I've seen the most times live is the <i>Zac Brown Band</i></li>
</ul>'''

soup= BeautifulSoup (html_text,'lxml')

[x.get_text() for x in soup.select('ul.fun-facts li') if ' is ' in x.get_text()]

Output

['Middle name is Ronald',
 'Dunkin Donuts coffee is better than Starbucks',
 "A favorite book series of mine is Ender's Game",
 'Current video game of choice is Rocket League',
 "The band that I've seen the most times live is the Zac Brown Band"]

Upvotes: 1

Related Questions