Reputation: 81
I am trying to extract the text from a class that appears within (and after) an tag like so:
from bs4 import BeautifulSoup
html = """<div class="wisbb_teamA">
<a href="http://www.example.com/eg1" class="wisbb_name">Phillies</a>
</div>"""
soup = BeautifulSoup(html,"lxml")
for div in soup.findAll('div', attrs={'class':'wisbb_teamA'}):
print(div.find('a').contents[0])
This will return the following:
Phillies
This is correct, but when I try to extract from the actual page, I get the following:
TypeError: object of type 'Response' has no len()
The page is at
https://www.foxsports.com/mlb/scores?season=2019&date=2019-09-23
And I have used the following:
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.foxsports.com/mlb/scores?season=2019&date=2019-09-23")
soup = BeautifulSoup(url,'lxml')
for div in soup.findAll('div', attrs={'class':'wisbb_teamA'}):
print(div.find('a').contents[0])
Thank you.
Upvotes: 0
Views: 111
Reputation: 28620
Could just go through the API to get the data you want exactly. Here I took the response and parsed into a table, but you don't necessarily need to do that.
import requests
url = 'https://api.foxsports.com/sportsdata/v1/baseball/mlb/events.json'
payload = {
'enable': 'odds,teamdetails',
'date': '20190923',
'apikey': 'jE7yBJVRNAwdDesMgTzTXUUSx1It41Fq'}
jsonData = requests.get(url, params=payload).json()
for each in jsonData['page']:
homeTeam = each['homeTeam']['name']
awayTeam = each['awayTeam']['name']
print ('Home Team: %s\nAway Team: %s\n' %(homeTeam, awayTeam))
Output:
Home Team: Nationals
Away Team: Phillies
Home Team: Blue Jays
Away Team: Orioles
Home Team: Rays
Away Team: Red Sox
Home Team: Mets
Away Team: Marlins
Home Team: Diamondbacks
Away Team: Cardinals
Parse to table:
import pandas as pd
import requests
import re
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '.')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '.')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
url = 'https://api.foxsports.com/sportsdata/v1/baseball/mlb/events.json'
payload = {
'enable': 'odds,teamdetails',
'date': '20190923',
'apikey': 'jE7yBJVRNAwdDesMgTzTXUUSx1It41Fq'}
jsonData = requests.get(url, params=payload).json()
flat = flatten_json(jsonData)
results = pd.DataFrame()
special_cols = []
columns_list = list(flat.keys())
for item in columns_list:
try:
row_idx = re.findall(r'\.(\d+)\.', item )[0]
except:
special_cols.append(item)
continue
column = re.findall(r'\.\d+\.(.*)', item )[0]
row_idx = int(row_idx)
value = flat[item]
results.loc[row_idx, column] = value
for item in special_cols:
results[item] = flat[item]
Upvotes: 1
Reputation: 103
You need to pass URL.text to beautifulsoup and try converting lxml to ’html.parser’
Your approach is fine and the problem is that the complete content is not getting fetched.
I have given below a sample code that works.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('../chromedriver.exe')
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
driver.get("https://www.foxsports.com/mlb/scores?season=2019&date=2019-09-23")
delay = 10 # seconds
try:
myElem = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'wisbb_teamA')))
print("Page is ready!")
except TimeoutException:
print("Loading took too much time!")
content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
soup.find("div", attrs={"class":"wisbb_teamA"})
Credits: This post
What I also found that the request.get wasn't going to the complete depth of the DOM. There should be a parameter to control the recursion depth and you should look into the documentation of urllib or requests.
But this will do the job for you.
Note that I had to use explicit waiting elements because even with selenium the complete HTML wasn't getting fetched so I had to wait for the particular element to get loaded
Update:
Use this to find the player names
for s in soup.findAll("div", attrs={"class":"wisbb_teamA"}):
links = s.findAll("a")
for link in links:
print(link.text)
Upvotes: 1
Reputation: 134
The error that you are receiving,
TypeError: object of type 'Response' has no len()
Is coming from the fact that your url
variable is a "response" object, not the actual html. You can get the html if you use the .text
method like the following
url = requests.get("https://www.foxsports.com/mlb/scores?season=2019&date=2019-09-23")
print(url) # Response object
print(url.text) # html
soup = BeautifulSoup(url.text) # new soup code, (it knows its html)
Here's an example of extracting links, that may send you off into the right direction
for link in soup.find_all('a'):
print(link.get('href'))
Upvotes: 2