user15143876
user15143876

Reputation:

'NoneType' error when scraping marketwatch.com with bs4

I'm strugling with a NoneType error when trying to scrape this piece of HTML:

<div class="article__content">
        
        
         <h3 class="article__headline">
                    <a class="link" href="https://www.marketwatch.com/story/infrastructure-bill-looks-set-to-pass-senate-without-changes-sought-by-crypto-advocates-2021-08-10?mod=cryptocurrencies">
                        
                        Infrastructure bill looks set to pass Senate without changes sought by crypto advocates
                    </a>
                </h3>
        <p class="article__summary">A $1 trillion bipartisan infrastructure bill on Tuesday appeared on track to pass the Senate without changes sought by the cryptocurrency industry&#x27;s supporters, as a deal among key senators on an amendment didn&#x27;t get suppo...</p>

        <div class="content--secondary">
                <div class="group group--tickers">
                            <bg-quote class="negative" channel="/zigman2/quotes/31322028/realtime">
                                <a class="ticker qt-chip j-qt-chip" data-charting-symbol="CRYPTOCURRENCY/US/COINDESK/BTCUSD" data-track-hover="QuotePeek" href="https://www.marketwatch.com/investing/cryptocurrency/btcusd?mod=cryptocurrencies">
                                    <span class="ticker__symbol">BTCUSD</span>
                                    <bg-quote class="ticker__change" field="percentChange" channel="/zigman2/quotes/31322028/realtime">-1.07%</bg-quote>
                                    <i class="icon"></i>
                                </a>
                            </bg-quote>
                </div>
                
        </div>
        <div class="article__details">
            <span class="article__timestamp" data-est="2021-08-10T10:42:34">Aug. 10, 2021 at 10:42 a.m. ET</span>

                <span class="article__author">by Victor Reklaitis</span>
            
        </div>
    </div>

my code look like this:

    for article in soup.find_all('div', class_='article__content'):
        date = article.find('span', class_='article__timestamp')['data-est']
        print(date)

Can someone explain me what is the problem and why this span couldn't be found?

Upvotes: 0

Views: 109

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195438

You need to filter out <div> tags which don't have timestamp:

import requests
from bs4 import BeautifulSoup


url = "https://www.marketwatch.com/investing/cryptocurrency?mod=side_nav"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for article in soup.find_all("div", class_="article__content"):
    date = article.find("span", class_="article__timestamp")
    if not date:
        continue
    print(date["data-est"])

Prints:

2021-08-10T10:42:34
2021-08-10T05:30:00
2021-08-09T19:15:00
2021-08-09T12:33:00
2021-08-09T11:22:00
2021-08-08T20:09:00
2021-08-07T15:14:00
2021-08-07T15:04:00
2021-08-06T09:15:27
2021-08-05T14:25:00
2021-08-05T11:17:00
2021-08-04T16:11:00
2021-08-02T17:07:00
2021-08-02T06:54:00
2021-08-01T21:01:00

Or with CSS selector:

for span in soup.select(".article__content .article__timestamp[data-est]"):
    print(span["data-est"])

Upvotes: 2

Related Questions