Matt
Matt

Reputation: 81

How can I extract text from a class tag that appears after an <a href> tag within a <div> using BeautifulSoup and Python?

I am trying to extract the text from a class that appears within (and after) an tag like so:

from bs4 import BeautifulSoup


html = """<div class="wisbb_teamA">
    <a href="http://www.example.com/eg1" class="wisbb_name">Phillies</a>
</div>"""

soup = BeautifulSoup(html,"lxml")

for div in soup.findAll('div', attrs={'class':'wisbb_teamA'}):
    print(div.find('a').contents[0])

This will return the following:

Phillies

This is correct, but when I try to extract from the actual page, I get the following:

TypeError: object of type 'Response' has no len()

The page is at

https://www.foxsports.com/mlb/scores?season=2019&date=2019-09-23

And I have used the following:

import requests
from bs4 import BeautifulSoup

url = requests.get("https://www.foxsports.com/mlb/scores?season=2019&date=2019-09-23")

soup = BeautifulSoup(url,'lxml')

for div in soup.findAll('div', attrs={'class':'wisbb_teamA'}):
    print(div.find('a').contents[0])

Thank you.

Upvotes: 0

Views: 111

Answers (3)

chitown88
chitown88

Reputation: 28620

Could just go through the API to get the data you want exactly. Here I took the response and parsed into a table, but you don't necessarily need to do that.

import requests

url = 'https://api.foxsports.com/sportsdata/v1/baseball/mlb/events.json'

payload = {
'enable': 'odds,teamdetails',
'date': '20190923',
'apikey': 'jE7yBJVRNAwdDesMgTzTXUUSx1It41Fq'}

jsonData = requests.get(url, params=payload).json()

for each in jsonData['page']:
    homeTeam = each['homeTeam']['name']
    awayTeam = each['awayTeam']['name']
    print ('Home Team: %s\nAway Team: %s\n' %(homeTeam, awayTeam))

Output:

Home Team: Nationals
Away Team: Phillies

Home Team: Blue Jays
Away Team: Orioles

Home Team: Rays
Away Team: Red Sox

Home Team: Mets
Away Team: Marlins

Home Team: Diamondbacks
Away Team: Cardinals  

Parse to table:

import pandas as pd
import requests
import re

def flatten_json(y):
    out = {}
    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '.')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '.')
                i += 1
        else:
            out[name[:-1]] = x
    flatten(y)
    return out


url = 'https://api.foxsports.com/sportsdata/v1/baseball/mlb/events.json'

payload = {
'enable': 'odds,teamdetails',
'date': '20190923',
'apikey': 'jE7yBJVRNAwdDesMgTzTXUUSx1It41Fq'}

jsonData = requests.get(url, params=payload).json()
flat = flatten_json(jsonData)


results = pd.DataFrame()
special_cols = []

columns_list = list(flat.keys())
for item in columns_list:
    try:
        row_idx = re.findall(r'\.(\d+)\.', item )[0]
    except:
        special_cols.append(item)
        continue
    column = re.findall(r'\.\d+\.(.*)', item )[0]

    row_idx = int(row_idx)
    value = flat[item]

    results.loc[row_idx, column] = value

for item in special_cols:
    results[item] = flat[item]

Upvotes: 1

Piyush Kansal
Piyush Kansal

Reputation: 103

You need to pass URL.text to beautifulsoup and try converting lxml to ’html.parser’

Your approach is fine and the problem is that the complete content is not getting fetched.

I have given below a sample code that works.

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('../chromedriver.exe')
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

driver.get("https://www.foxsports.com/mlb/scores?season=2019&date=2019-09-23")
delay = 10 # seconds
try:
    myElem = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'wisbb_teamA')))
    print("Page is ready!")
except TimeoutException:
    print("Loading took too much time!")

content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
soup.find("div", attrs={"class":"wisbb_teamA"})

Credits: This post

What I also found that the request.get wasn't going to the complete depth of the DOM. There should be a parameter to control the recursion depth and you should look into the documentation of urllib or requests.

But this will do the job for you.

Note that I had to use explicit waiting elements because even with selenium the complete HTML wasn't getting fetched so I had to wait for the particular element to get loaded

Update:

Use this to find the player names

for s in soup.findAll("div", attrs={"class":"wisbb_teamA"}):
    links = s.findAll("a")
    for link in links:
        print(link.text)

Upvotes: 1

JimithyPicker
JimithyPicker

Reputation: 134

The error that you are receiving,

TypeError: object of type 'Response' has no len()

Is coming from the fact that your url variable is a "response" object, not the actual html. You can get the html if you use the .text method like the following

url = requests.get("https://www.foxsports.com/mlb/scores?season=2019&date=2019-09-23")

print(url) # Response object
print(url.text) # html

soup = BeautifulSoup(url.text) # new soup code, (it knows its html)

Here's an example of extracting links, that may send you off into the right direction

for link in soup.find_all('a'):
        print(link.get('href'))

Upvotes: 2

Related Questions