bdure
bdure

Reputation: 101

Getting first (or a specific) td in BeautifulSoup with no class

I have one of those nightmare tables with no class given for the tr and td tags.

A sample page is here: https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m

(You'll see in the code below that I'm getting multiple pages, but that's not the problem.)

I want the team name (nothing else) from each bracket. The output should be:

OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
Miami Rush Kendall SC
IMG
Tampa Bay United
etc.

I've been able to get every td in the specified tables. But every attempt to use [0] to get the first td of every row gives me an "index out of range" error.

The code is:

import requests
import csv 
from bs4 import BeautifulSoup

batch_size = 2
urls = ['https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m', 'https://system.gotsport.com/org_event/events/1271/schedules?age=17&gender=m']

# iterate through urls
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")



# iterate through leagues and teams
    leagues = soup.find_all('table', class_='table table-bordered table-hover table-condensed')
    for league in leagues:
        row = ''
        rows = league.find_all('tr')
        for row in rows:
            team = row.find_all('td')
            teamName = team[0].text.strip()    
            print(teamName)

After a couple of hours of work, I feel like I'm just one syntax change away from getting this right. Yes?

Upvotes: 1

Views: 1215

Answers (3)

MendelG
MendelG

Reputation: 20018

You can use a CSS Selector nth-of-type(n). It works for both links:

import requests
from bs4 import BeautifulSoup

url = "https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for tag in soup.select(".small-margin-bottom td:nth-of-type(1)"):
    print(tag.text.strip())

Output:

OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
...
...
Real Salt Lake U19
Real Colorado
Empire United Soccer Academy

Upvotes: 2

Paul M.
Paul M.

Reputation: 10799

Each bracket corresponds to one "panel", and each panel has two rows, the first of which contains the first table of all teams in the match tables.

def main():

    import requests
    from bs4 import BeautifulSoup

    url = "https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m"

    response = requests.get(url)
    response.raise_for_status()
    
    soup = BeautifulSoup(response.content, "html.parser")

    for panel in soup.find_all("div", {"class": "panel-body"}):
        for row in panel.find("tbody").find_all("tr"):
            print(row.find("td").text.strip())
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
Miami Rush Kendall SC
IMG
Tampa Bay United
Weston FC
Chargers SC
South Florida FA
Solar SC
RISE SC
...

Upvotes: 1

Alexandra Dudkina
Alexandra Dudkina

Reputation: 4462

I think the problem is with the header of the table, which contains th elements instead of td elements. It leads to the index of range error, when you try to retrieve first element from an empty list. Try to add check for the length of the td:

for row in rows:
    team = row.find_all('td')
    if(len(team) > 0):
        teamName = team[0].text.strip()    
        print(teamName)

It should print you the team names.

Upvotes: 0

Related Questions