kendall weihe
kendall weihe

Reputation: 73

Python script extract data from HTML page

I'm trying to do a massive data accumulation on college basketball teams. This link: https://www.teamrankings.com/ncb/stats/ has a TON of team stats.

I have tried to write a script that scans all the desired links (all Team Stats) from this page, finds the rank of the specified team (an input), then returns the sum of that teams ranks from all links.

I graciously found this: https://gist.github.com/phillipsm/404780e419c49a5b62a8

...which is GREAT!

But I must have something wrong because I'm getting 0

Here's my code:

import requests
from bs4 import BeautifulSoup
import time

url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")

stat_links = []

for table_row in soup.select(".expand-section li"):

    table_cells = table_row.findAll('li')

    if len(table_cells) > 0:
        link = table_cells[0].find('a')['href']
        stat_links.append(link)

total_rank = 0

for link in stat_links:
    r = requests.get(link)
    soup = BeaultifulSoup(r.text)

    team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")

    for row in team_rows:
        if row.findAll('td')[1].text.strip() == 'Oklahoma':
            rank = row.findAll('td')[0].text.strip()
            total_rank = total_rank + rank

print total_rank

Check out that link to double check I have the correct class specified. I have a feeling the problem might be in the first for loop where I select an li tag then select all li tags within that first tag, I dunno.

I don't use Python so I'm unfamiliar with any debugging tools. So if anyone wants to forward me to one of those that would be great!

Upvotes: 1

Views: 353

Answers (2)

floydn
floydn

Reputation: 1131

First, the team stats and player stats sections are contained in a 'div class='large column-2'. The team stats are in the first occurrence. Then you can find all of the href tags within it. I've combined both in a one-liner.

teamstats = soup(class_='column large-2')[0].find_all(href=True)

The teamstats list contains all of the 'a' tags. Use a list comprehension to extract the links. A few of the hrefs contained "#" (part of navigation links) so I excluded them.

links = [a['href'] for a in teamstats if a['href'] != '#']

Here is a sample of output:

links
Out[84]: 
['/ncaa-basketball/stat/points-per-game',
 '/ncaa-basketball/stat/average-scoring-margin',
 '/ncaa-basketball/stat/offensive-efficiency',
 '/ncaa-basketball/stat/floor-percentage',
 '/ncaa-basketball/stat/1st-half-points-per-game',

Upvotes: 1

Michael Sova
Michael Sova

Reputation: 1

A ran your code on my machine and the line --> table_cells = table_row.findAll('li') , always returns an empty list, so stat_links ends up being an empty array, therefore the iteration over stat_links never gets carried out and total_rank will not get incremented. I suggest you fiddle around with the way you find all the list elements.

Upvotes: 0

Related Questions