Reputation: 43
I'm very new to web scraping and have run into an issue where I'm trying to scrape the World Football Elo Ratings webpage (https://www.eloratings.net/) for a data science project I'm working on but I'm not getting the nested HTML elements, only the "top level" as shown below:
<!DOCTYPE html>
<html lang="en"><head><title>World Football Elo Ratings</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="Ratings for national football teams based on the Elo rating system." name="description"/>
<meta content="football, ratings, Elo, rankings, national, international, soccer, teams" name="keywords"/>
<meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport"/>
<link href="scripts/slick.grid.css" rel="stylesheet" type="text/css"/>
<link href="scripts/dygraph.css" rel="stylesheet" type="text/css"/>
<script src="scripts/dygraph.js" type="text/javascript"></script>
<script src="scripts/jquery.js" type="text/javascript"></script>
<script src="scripts/slick.core.js" type="text/javascript"></script>
<script src="scripts/slick.grid.js" type="text/javascript"></script>
<script src="scripts/cldr.js" type="text/javascript"></script>
<script src="scripts/event.js" type="text/javascript"></script>
<script src="scripts/supplemental.js" type="text/javascript"></script>
<script src="scripts/globalize.js" type="text/javascript"></script>
<script src="scripts/number.js" type="text/javascript"></script>
<script src="scripts/date.js" type="text/javascript"></script>
<script src="scripts/ratings.js" type="text/javascript"></script>
<link href="scripts/css.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<div class="main" id="main">
<h1 class="mainheader" id="mainheader"></h1>
<div class="topnav" id="topnav"></div>
<h3 class="subheader" id="subheader"></h3>
<div class="maindiv" id="maindiv"></div>
</div>
<div class="mainmenu" id="mainmenu"></div>
<div class="mainloader">
<div class="loadheader" id="loadheader">World Football Elo Ratings</div>
</div>
</body>
</html>
And here is my code so far:
import requests
from bs4 import BeautifulSoup
import pprint
response = requests.get('https://www.eloratings.net/')
soupObject = BeautifulSoup(response.text, 'html.parser')
pprint.pprint(soupObject)
My initial thought is that JavaScript is being used to generate the majority of the HTML, but I am unsure if this is the case, or how to solve it if it is.
Any advice would be greatly appreciated.
Upvotes: 0
Views: 134
Reputation: 43
****** My Final Solution ******
Thank you to all the individuals who helped me resolve this, below is the implementation I am using:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get('https://www.eloratings.net/')
teamData = driver.find_elements(By.CLASS_NAME, 'ui-widget-content')
From this, if for example you do:
print(teamData[0].text)
Ouput will be (at the time of writing):
1
Brazil
2150
4
1999
0
+1
1030
364
331
335
657
162
211
2237
914
Upvotes: 0
Reputation: 11
I am relatively new to Stack Overflow, and in fact you are the first question I am going to try to offer any advice to!
I am not too sure what you are looking to do, ie: are you trying to get each country and their stats? Or are you simply looking for the order of rankings?
I have in the past done something similar using Selenium.
I loaded up the webpage you are looking to scrape and tried to figure out how I would do it.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
import time
fireFoxOptions = Options()
fireFoxOptions.headless = True
driver = webdriver.Firefox(options=fireFoxOptions)
driver.get("https://www.eloratings.net/")
original_window = driver.current_window_handle
wait = WebDriverWait(driver, 10)
time.sleep(10)
num = 1
stats = []
for i in range(1,240):
div_name = f"div.ui-widget-content:nth-child({num})"
element = driver.find_elements(By.CSS_SELECTOR, div_name)
num = num + 1
stats.append(element)
print(stats)
This little bit of code will go in headless mode (no gui) of firefox and get all the div elements that match the css_selector. Unfortunately their wasn't a common CSS_SELECTOR name between all the elements yet they did have a pattern of just changing the number in the (). So just using a simple four loop we can get all of them. From here if you wanted to get each link for instance you would do something like:
for stat in stats:
link = stats.get_attribute("href")
Then you could iterate through those links and follow them to the their teams page.
Upvotes: 1
Reputation: 43962
You are right, the table is generated by Javascript, bs4 won't be able to find it.
If you look at the network tab, you'll see a request to this url:
https://www.eloratings.net/World.tsv?_=1670338063316
This gives an World.tsv
which contains the table.
This can be parsed using the CSV module:
How to parse tsv file with python?
Upvotes: 3