Robsmith
Robsmith

Reputation: 473

Iterate Over URLs Using BeautifulSoup

I have written some code to gather URLs for each race course from https://www.horseracing.net/racecards. I have also written some code to scrape data from each race course page.

Each bit of code works as it should but I am having trouble creating a for loop to loop through all the race course URLs.

Here's the code to scrape the course URLs:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

todays_racecard_url = 'https://www.horseracing.net/racecards'
base_url = "https://www.horseracing.net"
reqs = requests.get(todays_racecard_url)
content = reqs.text
soup = BeautifulSoup(content, 'html.parser')
course_urls = []

for h in soup.findAll('h3'):
    a = h.find('a')

    try:
        if 'href' in a.attrs:
            card_url = urljoin(base_url, a.get('href'))
            course_urls.append(card_url)
    except:
        pass

for card_url in course_urls:
    print(card_url)

And here's the code to scrape the pages:

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

url = "https://www.horseracing.net/racecards/fontwell/13-05-21"

results = requests.get(url)

soup = BeautifulSoup(results.text, "html.parser")

date = []
course = []
time = []
runner = []
tips = []
tipsters = []

runner_div = soup.find_all('div', class_='row-cell-right')

for container in runner_div:

    runner_name = container.h5.a.text
    runner.append(runner_name)

    tips_no = container.find('span', class_='tip-text number-tip').text if container.find('span', class_='tip-text number-tip') else ''
    tips.append(tips_no)

    tipster_names = container.find('span', class_='pointers-text currency-text').text if container.find('span', class_='pointers-text currency-text') else ''
    tipsters.append(tipster_names)

newspaper_tips = pd.DataFrame({
'Runners': runner,
'Tips': tips,
'Tipsters': tipsters,
})

newspaper_tips['Tipsters'] = newspaper_tips['Tipsters'].str.replace(' - ', '')

newspaper_tips.to_csv('NewspaperTips.csv', mode='a', header=False, index=False)

How do I join them to get the result I'm looking for?

Upvotes: 2

Views: 70

Answers (1)

Martin Evans
Martin Evans

Reputation: 46789

It could be combined as follows:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

todays_racecard_url = 'https://www.horseracing.net/racecards'
base_url = "https://www.horseracing.net"

req = requests.get(todays_racecard_url)
soup_racecard = BeautifulSoup(req.content, 'html.parser')
df = pd.DataFrame(columns=['Runners', 'Tips', 'Tipsters'])

for h in soup_racecard.find_all('h3'):
    a = h.find('a', href=True)    # only find tags with href present
    
    if a:
        url = urljoin(base_url, a['href'])
        print(url)
        results = requests.get(url)
        soup_url = BeautifulSoup(results.text, "html.parser")

        for container in soup_url.find_all('div', class_='row-cell-right'):
            runner_name = container.h5.a.text
            tips_no = container.find('span', class_='tip-text number-tip').text if container.find('span', class_='tip-text number-tip') else ''
            tipster_names = container.find('span', class_='pointers-text currency-text').text if container.find('span', class_='pointers-text currency-text') else ''
            row = [runner_name, tips_no, tipster_names]
            df.loc[len(df)] = row       # append the new row

df['Tipsters'] = df['Tipsters'].str.replace(' - ', '')
df.to_csv('NewspaperTips.csv', index=False)    

Giving you a CSV starting:

Runners,Tips,Tipsters
Ajrad,2,NEWMARKET
Royal Tribute,1,The Times
Time Interval,1,Daily Mirror
Hemsworth,1,Daily Express
Ancient Times,,
Final Watch,,
Hala Joud,,
May Night,1,The Star
Tell'Em Nowt,,

Upvotes: 1

Related Questions