Reputation: 27
I am having some issues with selecting a drop-down button to then select additional options to change the web page. I am using Selenium in Python to extract this data. URL is https://www.transfermarkt.com/premierleague/startseite/wettbewerb/GB1/plus/?saison_id=2019
Code so far:
driver = webdriver.Chrome('C:/Users/bzholle/chromedriver.exe')
driver.get('https://www.transfermarkt.com/premierleague/startseite/wettbewerb/GB1/plus/?saison_id=2019')
#click out of iframe pop-up window
driver.switch_to.frame(driver.find_element_by_css_selector('iframe[title="SP Consent Message"]'))
accept_button = driver.find_element_by_xpath("//button[@title='ACCEPT ALL']")
accept_button.click()
driver.find_element_by_id("choosen-country").click()
I keep getting: NoSuchElementException: Message: no such element: Unable to locate element
In the HTML code, the list of countries does not appear until the drop down arrow is clicked; howevever I cannot get the button to click. Anyone have any suggestions?
Upvotes: 0
Views: 154
Reputation: 10809
You neglected to mention what information you're actually trying to scrape, so the following alternative solution I'm proposing can only help you so much. If you could elaborate, and let me know what information you're trying to scrape, I can tailor my solution.
Logging ones network traffic (while viewing the page in a browser) reveals that multiple XHR (XmlHttpRequest) HTTP GET requests are made to various REST API endpoints, the response of which is JSON, and contains all the information you are likely to want to scrape.
What I'm suggesting is to simply imitate that HTTP GET request to the neccessary REST API endpoints. No Selenium required:
def get_country_id(country_name):
import requests
url = "https://www.transfermarkt.com/quickselect/countries"
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
return next((country["id"] for country in response.json() if country["name"] == country_name), None)
def get_competitions(country_id):
import requests
url = "https://www.transfermarkt.com/quickselect/competitions/{}".format(country_id)
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
return response.json()
def main():
country_name = "Iceland"
country_id = get_country_id(country_name)
assert country_id is not None
print("Competitions in {}:".format(country_name))
for competition in get_competitions(country_id):
print(competition["name"])
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Competitions in Iceland:
Pepsi Max deild
Lengjudeild
Mjólkurbikarinn
Lengjubikarinn
>>>
EDIT - The table data you're trying to scrape unfortunately does not originate from an API. It's baked directly into the HTML of the page. Still, you don't need to use Selenium for this - BeautifulSoup is good enough:
def get_entries():
import requests
from bs4 import BeautifulSoup as Soup
from operator import attrgetter
url = "https://www.transfermarkt.com/premierleague/startseite/wettbewerb/GB1/plus/"
params = {
"saison_id": "2019"
}
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
soup = Soup(response.content, "html.parser")
table = soup.find("table", {"class": "items"})
assert table is not None
# Get text from header cells whose class does not contain the substring "hide"
fieldnames = list(map(attrgetter("text"), table.select("thead > tr > th:not([class*=\"hide\"])")))
yield fieldnames
for row in table.select("tbody > tr"):
# Assuming the first column will always be an img
columns = list(map(attrgetter("text"), row.select("td:not([class*=\"hide\"])")[1:]))
yield dict(zip(fieldnames, columns))
def main():
from csv import DictWriter
entries = get_entries()
fieldnames = next(entries)
with open("output.csv", "w", newline="") as file:
writer = DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for entry in entries:
writer.writerow(entry)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
CSV Output:
club,Squad,Total MV,ø MV
Man City,34,€1.27bn,€37.46m
Liverpool,56,€1.09bn,€19.53m
Spurs,36,€1.04bn,€28.94m
Chelsea,36,€797.00m,€22.14m
Man Utd,43,€775.20m,€18.03m
Arsenal,38,€680.55m,€17.91m
Everton,35,€525.50m,€15.01m
Leicester,32,€384.75m,€12.02m
West Ham,38,€371.75m,€9.78m
Wolves,44,€315.40m,€7.17m
Newcastle,41,€312.58m,€7.62m
Bournemouth,39,€311.20m,€7.98m
Watford,43,€270.65m,€6.29m
Southampton,36,€259.80m,€7.22m
Crystal Palace,33,€248.65m,€7.53m
Brighton,45,€225.83m,€5.02m
Burnley,35,€205.58m,€5.87m
Aston Villa,38,€184.60m,€4.86m
Norwich,38,€110.85m,€2.92m
Sheff Utd,34,€110.80m,€3.26m
The real solution would probably involve combining requests to REST APIs and scraping table data via BeautifulSoup - You would iterate over every country, every competition in that country and for every year. The updated code I've posted assumes we're only interested in the competition with ID GB1
(which is in England), and only for 2019.
EDIT - you'll have to tweak my solution a bit. I filter and retain only those columns whose class does not contain the substring "hide", but it turns out some of them are important (like the age
column, for example).
Upvotes: 1
Reputation: 3717
There are two problems here:
driver.switch_to.default_content()
to switch back out of the iframe
shadow root
. The only way I know to identify such an element is kind of hacky, it involves executing javascript to get the shadow root, then finding the element in the shadow root. If I use this code, it works to click that element:driver = webdriver.Chrome('C:/Users/bzholle/chromedriver.exe')
driver.get('https://www.transfermarkt.com/premierleague/startseite/wettbewerb/GB1/plus/?saison_id=2019')
#click out of iframe pop-up window
driver.switch_to.frame(driver.find_element_by_css_selector('iframe[title="SP Consent Message"]'))
accept_button = driver.find_element_by_xpath("//button[@title='ACCEPT ALL']")
accept_button.click()
driver.switch_to.default_content()
shadow_section = driver.execute_script('''return document.querySelector("tm-quick-select-bar").shadowRoot''')
shadow_section.find_element_by_id("choosen-country").click()
Upvotes: 3