Sully Marshall
Sully Marshall

Reputation: 37

Ommitting specific text using BeautifulSoup

Using BeautifulSoup I'm attempting to extract some very specific text from a website using a custom lambda function. I'm struggling to pick out exactly what I need while leaving the stuff out I don't need.

<div class="article__content">
      
            <h3 class="article__headline">
                <span class="article__label barrons">
Barron&#x27;s                    </span>
                
                    <a class="link" href="https://www.marketwatch.com/articles/more-bad-times-ahead-for-these-6-big-tech-stocks-51652197183?mod=mw_quote_news">
                        
                        
                        More Bad Times Ahead for These 6 Big Tech Stocks
                    </a>
            </h3>
        

        
        <div class="article__details">
            <span class="article__timestamp" data-est="2022-05-10T11:39:00">May. 10, 2022 at 11:39 a.m. ET</span>

                
            
        </div>
    </div>

</div>

I'm looking to extract just the news headline - in this case it's "More Bad Times Ahead for These 6 Big Tech Stocks" and leave behind the annoying heading "Barron".

So far my function looks like:

for txt in soup.find_all(lambda tag: tag.name == 'h3' and tag.get('class') == ['article__headline']):
     print(txt.text)

I've attempted tag.name = "a" and tag.get('class') == ['link'] but that returns a load of other stuff I don't need from the webpage...

Upvotes: 1

Views: 30

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195438

Try CSS selector h3 a (select all <a> tags which are inside <h3> tag):

for title in soup.select("h3 a"):
    print(title.text.strip())

Prints:

More Bad Times Ahead for These 6 Big Tech Stocks

If you want to be more specific:

for title in soup.select("h3.article__headline a"):
    print(title.text.strip())

Upvotes: 1

Related Questions