Jin
Jin

Reputation: 93

How do you web-scrape past a "show more" button using BeautifulSoup Python?

I am using BeautifulSoup on python to scrape football statistics from this website: https://www.skysports.com/premier-league-results/2020-21. Yet the site only shows the first 200 games of the season and the rest of the 180 games are behind a "show more" button. The button does not change the url so I can't just replace the url.

This is my code:

from bs4 import BeautifulSoup
import requests

scores_html_text = requests.get('https://www.skysports.com/premier-league-results/2020-21').text
scores_soup = BeautifulSoup(scores_html_text, 'lxml')

fixtures = scores_soup.find_all('div', class_ = 'fixres__item')

This only gets the first 200 fixtures.

How would I access the html past the show more button?

Upvotes: 2

Views: 2252

Answers (3)

Andrej Kesely
Andrej Kesely

Reputation: 195438

The hidden results are inside <script> tag, so to get all 380 results you need to parse it additionally:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.skysports.com/premier-league-results/2020-21"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

script = soup.select_one('[type="text/show-more"]')
script.replace_with(BeautifulSoup(script.contents[0], "html.parser"))

all_data = []
for item in soup.select(".fixres__item"):
    all_data.append(item.get_text(strip=True, separator="|").split("|")[:5])
    all_data[-1].append(
        item.find_previous(class_="fixres__header2").get_text(strip=True)
    )

df = pd.DataFrame(
    all_data, columns=["Team 1", "Score 1", "Score 2", "Time", "Team 2", "Date"]
)
print(df)
df.to_csv("data.csv", index=False)

Prints:

                       Team 1 Score 1 Score 2   Time                    Team 2                     Date
0                     Arsenal       2       0  16:00  Brighton and Hove Albion          Sunday 23rd May
1                 Aston Villa       2       1  16:00                   Chelsea          Sunday 23rd May
2                      Fulham       0       2  16:00          Newcastle United          Sunday 23rd May
3                Leeds United       3       1  16:00      West Bromwich Albion          Sunday 23rd May

...

377            Crystal Palace       1       0  15:00               Southampton  Saturday 12th September
378                 Liverpool       4       3  17:30              Leeds United  Saturday 12th September
379           West Ham United       0       2  20:00          Newcastle United  Saturday 12th September

and saves data.csv (screenshot from LibreOffice):

enter image description here

Upvotes: 3

K. Iniyan
K. Iniyan

Reputation: 15

i tried to go up a few levels and this worked , u might need to process it a wee bit more.

from bs4 import BeautifulSoup
import requests

scores_html_text = requests.get('https://www.skysports.com/premier-league-results/2020-21').text
scores_soup = BeautifulSoup(scores_html_text,'lxml')

fixtures = scores_soup.find(class_ = 'site-layout-secondary block page-nav__offset grid')
print(fixtures)

Upvotes: 0

Cyber
Cyber

Reputation: 172

I am not aware of how to do this with BeautifulSoup, but this is how I would do it using Selenium (note that I am very new to Selenium, so there are probably better ways of doing this).

The imports used are:

from selenium import webdriver
import time

You will also need to download the Chrome webdriver (assuming that you are on Chrome), and place it in the same directory as your script, or in your library path.

There will be a cookies popup which you have to workaround:

# prepare the driver
URL = "https://www.skysports.com/premier-league-results/2020-21"
driver = webdriver.Chrome()
driver.get(URL)
# wait so that driver has loaded before we look for the cookies popup
time.sleep(2)

# accept cookies popup, which occurs in an iframe
# begin by locating iframe
frame = driver.find_element_by_id('sp_message_iframe_533903')
# find the accept button (inspect element and copy Xpath of button)
driver.find_element_by_xpath('//*[@id="notice"]/div[3]/button[1]').click()
time.sleep(2)
driver.refresh()

# find "show more text" button and click
driver.find_element_by_class_name("plus-more__text").click()

Upvotes: 0

Related Questions