Ethan
Ethan

Reputation: 31

Requests and PhantomJS not returning full html code

I am trying to webscrape multiple hockey websites to return the scores and team names. I am using requests and phantomjs to extract html and beautiful soup to parse the data. However, for the website below, when I use requests or phantomjs to get the html code they do not return all the html code (the parts I need).

AHL Website: https://theahl.com/stats/daily-schedule/2021-2-7?league=4&season=68&division=-1 When pressing inspect, I get the team name under <div class = "ht-team-name" ... > and the score under <span class = "ht-period-value ht-total">. However, when I run the following two code examples these two lines of code (plus many more lines) disappear. Not sure why this is happening, any solutions would be awesome!

Trying with requests (doesn't work):

from bs4 import BeautifulSoup
import requests

url = "https://theahl.com/stats/daily-schedule/2021-2-7?league=4&season=68&division=-1"
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36"}
html = requests.get(url,headers=headers).content
soup = BeautifulSoup(html, 'html.parser')

team_name = soup.find_all('div',{'class':'ht-team-name'})
team_score = soup.find_all('span',{'class':'ht-period-value ht-total'})

#Prints list of team names but is empty list
print(team_name)
#Prints list of scores but is empty list
print(team_score)

or Trying with phantomjs (worked for some of the other websites for me that didn't work with requests, but this doesn't work either)

from bs4 import BeautifulSoup 
from selenium import webdriver

url = "https://theahl.com/stats/daily-schedule/2021-2-7?league=4&season=68&division=-1"
browser = webdriver.PhantomJS('phantomjs/phantomjs.exe')
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')

team_name = soup.find_all('div',{'class':'ht-team-name'})
team_score = soup.find_all('span',{'class':'ht-period-value ht-total'})

#Prints list of team names but is empty list
print(team_name)
#Prints list of scores but is empty list
print(team_score)

As a little side note, if you print the html instead of printing team_name and team_score the two classes are still not in the code, so I don't think its how I am parsing the html, but it could be lol!

Upvotes: 0

Views: 80

Answers (1)

user5386938
user5386938

Reputation:

requests won't help since the contents are loaded via javascript. Perhaps you just need to wait a bit for the contents to load.

import time
from selenium import webdriver

url = "https://theahl.com/stats/daily-schedule/2021-2-7?league=4&season=68&division=-1"
browser = webdriver.Chrome()
browser.get(url)

time.sleep(10)

print(browser.find_elements_by_css_selector('div.ht-team-name'))
print(browser.find_elements_by_css_selector('span.ht-total.ht-period-value'))

browser.quit()

The above code outputs a few matches for both selectors.

time.sleep(...) because I am too lazy to use waits.

BTW, I do not have PhantomJS, hence Chrome.

Upvotes: 2

Related Questions