Subin Sen
Subin Sen

Reputation: 1

Scrape all URLs of a webpage

I have the following url https://www.gbgb.org.uk/greyhound-profile/?greyhoundId=517801 where the last 6 digits is a unique identifier for a specific runner. I want to find all of the 6 digit unique identifiers on this page.

I've tried to scrape all urls on the page (code shown below), but unfortunately I only get a high-level summary. Rather than an in depth list which should contain >5000 runners. Im hoping to get a list/dataframe which shows:

  1. https://www.gbgb.org.uk/greyhound-profile/?greyhoundId=517801

  2. https://www.gbgb.org.uk/greyhound-profile/?greyhoundId=500000

  3. https://www.gbgb.org.uk/greyhound-profile/?greyhoundId=500005

etc.

This is what i've been able to do so far. I appreciate any help!

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request("https://www.gbgb.org.uk//greyhound-profile//")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "lxml")

links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))

print(links)

Thanks for the help in advance!

Upvotes: 0

Views: 136

Answers (2)

Schopen Hacker
Schopen Hacker

Reputation: 322

you can convert the result content to a pandas dataframe then just use winnerOr2ndName and winnerOr2ndId columns

Example

import json
import requests
import pandas as pd

def get_items(dog_id):
    url = f"https://api.gbgb.org.uk/api/results/dog/{dog_id}?page=-1"
    params = {"page": "-1", "itemsPerPage": "20", "race_type": "race"}
    response = requests.get(url, params=params).json()
    MAX_PAGES = response["meta"]["pageCount"]
    result = pd.DataFrame(pd.DataFrame(response["items"]).loc[:, ['winnerOr2ndName','winnerOr2ndId']].dropna())
    result["winnerOr2ndId"] = result["winnerOr2ndId"].astype(int)
    
    while int(params.get("page"))<MAX_PAGES:
        params["page"] = str(int(params.get("page")) + 1)
        response = requests.get(url, params=params).json()
        new_items = pd.DataFrame(pd.DataFrame(response["items"]).loc[:, ['winnerOr2ndName','winnerOr2ndId']].dropna())
        new_items["winnerOr2ndId"] = new_items["winnerOr2ndId"].astype(int)
        result = pd.concat([result, new_items])
    
    return result.drop_duplicates()

It would generate a dataframe looking like this:

enter image description here

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195428

The data is loaded dynamicall from the external API URL. You can use next example how to load the data (with the IDs):

import json
import requests


api_url = "https://api.gbgb.org.uk/api/results/dog/517801"  # <-- 517801 is the ID from your URL in the question
params = {"page": "1", "itemsPerPage": "20", "race_type": "race"}

page = 1
while True:
    params["page"] = page
    data = requests.get(api_url, params=params).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    if not data["items"]:
        break

    for i in data["items"]:
        print(
            "{:<30} {}".format(
                i.get("winnerOr2ndName", ""), i.get("winnerOr2ndId", "")
            )
        )

    page += 1

Prints:

Ferndale Boom                  534358
Laganore Mustang               543937
Tickity Kara                   535237
Thor                           511842
Ballyboughlewiss               519556
Beef Cakes                     551323
Distant Millie                 546674
Lissan Kels                    525148
Rosstemple Marko               534276
Happy Harry                    550042
Porthall Ella                  550841
Southlodge Eden                531677
Effernogue Beef                547416
Faydas Truffle                 528780
Johns Lass                     538763
Faydas Truffle                 528780
Toms Hero                      543659
Affane Buzz                    547555
Emkay Flyer                    531456
Ballymac Tilly                 492923
Kilcrea Duke                   542178
Sporting Sultan                541880
Droopys Poet                   542020
Shortwood Elle                 527241
Rosstemple Marko               534276
Erics Bozo                     541863
Swift Launch                   536667
Longsearch                     523017
Swift Launch                   536667
Takemyhand                     535023
Floral Print                   527192
Rustys Aero                    497270
Autumn Dapper                  519528
Droopys Kiwi                   511989
Deep Chest                     520634
Newtack Henry                  525511
Indian Nightmare               524636
Lady Mascara                   528399
Tarsna Yankee                  517373
                               
Leathems Act                   516918
Final Star                     514015
Ascot Faye                     500812
Ballymac Ernie                 503569

Upvotes: 2

Related Questions