Martin598
Martin598

Reputation: 1581

Web scraping with python request and beautiful soup

I am trying to scrape all the articles on this web page: https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/

I can scrape the first article, but need help understanding how to jump to the next article and scrape the information there. Thank you in advance for your support.

import requests
from bs4 import BeautifulSoup

class Content:
    def __init__(self,url,title,body):
        self.url = url
        self.title = title
        self.body = body

def getPage(url):
    req = requests.get(url)
    return BeautifulSoup(req.text, 'html.parser')

# Scaping news articles from Coindesk

def scrapeCoindesk(url):
    bs = getPage(url)
    title = bs.find("h3").text
    body = bs.find("p",{'class':'desc'}).text
    return Content(url,title,body)

# Pulling the article from coindesk

url = 'https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/'
content = scrapeCoindesk(url)
print ('Title:{}'.format(content.title))
print ('URl: {}\n'.format(content.url))
print (content.body)

Upvotes: 2

Views: 928

Answers (2)

Ajax1234
Ajax1234

Reputation: 71451

You can use find_all with BeautifulSoup:

from bs4 import BeautifulSoup as soup
from collections import namedtuple
import request, re
article = namedtuple('article', 'title, link, timestamp, author, description')
r = requests.get('https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/').text
full_data = soup(r, 'lxml')
results = [[i.text, i['href']] for i in full_data.find_all('a', {'class':'fade'})]
timestamp = [re.findall('(?<=\n)[a-zA-Z\s]+[\d\s,]+at[\s\d:]+', i.text)[0] for i in full_data.find_all('p', {'class':'timeauthor'})]
authors = [i.text for i in full_data.find_all('a', {'rel':'author'})]
descriptions = [i.text for i in full_data.find_all('p', {'class':'desc'})]
full_articles = [article(*(list(i[0])+list(i[1:]))) for i in zip(results, timestamp, authors, descriptions) if i[0][0] != '\n ']

Output:

[article(title='Topping Out? Bitcoin Bulls Need to Defend $9K', link='https://www.coindesk.com/topping-out-bitcoin-bulls-need-to-defend-9k/', timestamp='May 8, 2018 at 09:10 ', author='Omkar Godbole', description='Bitcoin risks falling to levels below $9,000, courtesy of the bearish setup on the technical charts. '), article(title='Bitcoin Risks Drop Below $9K After 4-Day Low', link='https://www.coindesk.com/bitcoin-risks-drop-below-9k-after-4-day-low/', timestamp='May 7, 2018 at 11:00 ', author='Omkar Godbole', description='Bitcoin is reporting losses today but only a break below $8,650 would signal a bull-to-bear trend change. '), article(title="Futures Launch Weighed on Bitcoin's Price, Say Fed Researchers", link='https://www.coindesk.com/federal-reserve-scholars-blame-bitcoins-price-slump-to-the-futures/', timestamp='May 4, 2018 at 09:00 ', author='Wolfie Zhao', description='Cai Wensheng, a Chinese angel investor, says he bought 10,000 BTC after the price dropped earlier this year.\n'), article(title='Bitcoin Looks for Price Support After Failed $10K Crossover', link='https://www.coindesk.com/bitcoin-looks-for-price-support-after-failed-10k-crossover/', timestamp='May 3, 2018 at 10:00 ', author='Omkar Godbole', description='While equity bulls fear drops in May, it should not be a cause of worry for the bitcoin market, according to historical data.'), article(title='Bitcoin Sets Sights Above $10K After Bull Breakout', link='https://www.coindesk.com/bitcoin-sets-sights-10k-bull-breakout/', timestamp='May 3, 2018 at 03:18 ', author='Wolfie Zhao', description="Goldman Sachs is launching a new operation that will use the firm's own money to trade bitcoin-related contracts on behalf of its clients.")]

Upvotes: 2

agubelu
agubelu

Reputation: 408

You can use the fact that every article is contained within a div.article to iterate over them:

def scrapeCoindesk(url):
    bs = getPage(url)
    articles = []
    for article in bs.find_all("div", {"class": "article"}):
        title = article.find("h3").text
        body = article.find("p", {"class": "desc"}).text
        article_url = article.find("a", {"class": "fade"})["href"]
        articles.append(Content(article_url, title, body))
    return articles


# Pulling the article from coindesk
url = 'https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/'
content = scrapeCoindesk(url)
for article in content:
    print(article.url)
    print(article.title)
    print(article.body)
    print("-------------")

Upvotes: 4

Related Questions