ben_11
ben_11

Reputation: 43

How do I use BeautifulSoup to access the entire HTML?

I'm very new to web scraping and have run into an issue where I'm trying to scrape the World Football Elo Ratings webpage (https://www.eloratings.net/) for a data science project I'm working on but I'm not getting the nested HTML elements, only the "top level" as shown below:

<!DOCTYPE html>

<html lang="en"><head><title>World Football Elo Ratings</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="Ratings for national football teams based on the Elo rating system." name="description"/>
<meta content="football, ratings, Elo, rankings, national, international, soccer, teams" name="keywords"/>
<meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport"/>
<link href="scripts/slick.grid.css" rel="stylesheet" type="text/css"/>
<link href="scripts/dygraph.css" rel="stylesheet" type="text/css"/>
<script src="scripts/dygraph.js" type="text/javascript"></script>
<script src="scripts/jquery.js" type="text/javascript"></script>
<script src="scripts/slick.core.js" type="text/javascript"></script>
<script src="scripts/slick.grid.js" type="text/javascript"></script>
<script src="scripts/cldr.js" type="text/javascript"></script>
<script src="scripts/event.js" type="text/javascript"></script>
<script src="scripts/supplemental.js" type="text/javascript"></script>
<script src="scripts/globalize.js" type="text/javascript"></script>
<script src="scripts/number.js" type="text/javascript"></script>
<script src="scripts/date.js" type="text/javascript"></script>
<script src="scripts/ratings.js" type="text/javascript"></script>
<link href="scripts/css.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<div class="main" id="main">
<h1 class="mainheader" id="mainheader"></h1>
<div class="topnav" id="topnav"></div>
<h3 class="subheader" id="subheader"></h3>
<div class="maindiv" id="maindiv"></div>
</div>
<div class="mainmenu" id="mainmenu"></div>
<div class="mainloader">
<div class="loadheader" id="loadheader">World Football Elo Ratings</div>
</div>
</body>
</html>

And here is my code so far:

import requests
from bs4 import BeautifulSoup
import pprint

response = requests.get('https://www.eloratings.net/')

soupObject = BeautifulSoup(response.text, 'html.parser')

pprint.pprint(soupObject)

My initial thought is that JavaScript is being used to generate the majority of the HTML, but I am unsure if this is the case, or how to solve it if it is.

Any advice would be greatly appreciated.

Upvotes: 0

Views: 134

Answers (3)

ben_11
ben_11

Reputation: 43

****** My Final Solution ******

Thank you to all the individuals who helped me resolve this, below is the implementation I am using:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get('https://www.eloratings.net/')

teamData = driver.find_elements(By.CLASS_NAME, 'ui-widget-content')

From this, if for example you do:

print(teamData[0].text)

Ouput will be (at the time of writing):

1
Brazil
2150
4
1999
0
+1
1030
364
331
335
657
162
211
2237
914

Upvotes: 0

Smorgashboard
Smorgashboard

Reputation: 11

I am relatively new to Stack Overflow, and in fact you are the first question I am going to try to offer any advice to!

I am not too sure what you are looking to do, ie: are you trying to get each country and their stats? Or are you simply looking for the order of rankings?

I have in the past done something similar using Selenium.

I loaded up the webpage you are looking to scrape and tried to figure out how I would do it.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
import time

fireFoxOptions = Options()
fireFoxOptions.headless = True

driver = webdriver.Firefox(options=fireFoxOptions)
driver.get("https://www.eloratings.net/")
original_window = driver.current_window_handle
wait = WebDriverWait(driver, 10)
time.sleep(10)

num = 1 
stats = []

for i in range(1,240):
    div_name = f"div.ui-widget-content:nth-child({num})"
    element = driver.find_elements(By.CSS_SELECTOR, div_name)
    num = num + 1
    stats.append(element)


print(stats)

This little bit of code will go in headless mode (no gui) of firefox and get all the div elements that match the css_selector. Unfortunately their wasn't a common CSS_SELECTOR name between all the elements yet they did have a pattern of just changing the number in the (). So just using a simple four loop we can get all of them. From here if you wanted to get each link for instance you would do something like:

for stat in stats:
    link = stats.get_attribute("href")

Then you could iterate through those links and follow them to the their teams page.

Upvotes: 1

0stone0
0stone0

Reputation: 43962

You are right, the table is generated by Javascript, bs4 won't be able to find it.


If you look at the network tab, you'll see a request to this url:

https://www.eloratings.net/World.tsv?_=1670338063316

This gives an World.tsv which contains the table.

This can be parsed using the CSV module:
How to parse tsv file with python?

Upvotes: 3

Related Questions