BLuta
BLuta

Reputation: 247

Scrape a school's top247 college football recruits of all-time

I am trying to scrape the table on google colab from the following web page: https://247sports.com/college/penn-state/Sport/Football/AllTimeRecruits/

Below is the python script I am trying to use...

Team = 'penn-state'

url = "https://247sports.com/college/" + str(Team) + "/Sport/Football/AllTimeRecruits/"

# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}

response = requests.get(url, headers = headers).content
soup = BeautifulSoup(response, "html.parser")
data = []

for tag in soup.find_all("li", class_="ri-page__list-item"):  # `[1:]` Since the first result is a table header
    rank = tag.find_next("span", class_="all-time-rank").text
    school = tag.find_next("span", class_="meta").text
    year = tag.find_next("span", class_="meta").text
    name = tag.find_next("a", class_="ri-page__name-link").text
    position = tag.find_next("div", class_="position").text
    height_weight = tag.find_next("div", class_="metrics").text
    rating = tag.find_next("span", class_="score").text
    nat_rank = tag.find_next("a", class_="natrank").text
    state_rank = tag.find_next("a", class_="sttrank").text
    pos_rank = tag.find_next("a", class_="posrank").text
#    status = tag.find_next("p", class_="commit-date withDate").text

    data.append(
        {
            "Rank": rank,
            "Name": name,
            "School": school,
            "Class of": year,
            "Position": position,
            "Height & Weight": height_weight,
            "Rating": rating,
            "National Rank": nat_rank,
            "State Rank": state_rank,
            "Position Rank": pos_rank,
#            "Date": status,
        }
    )

df = pd.DataFrame(data)

df

I want to get a column for what recruiting class year that player was in. For example, if a player is from "class of 2005", I want "2005" as the column value for the "year" column.

    Rank    Name    School  Class of    Position    Height & Weight Rating  National Rank   State Rank  Position Rank
0   1   Derrick Williams    Eleanor Roosevelt (Greenbelt, MD)   Eleanor Roosevelt (Greenbelt, MD)   WR  6-0 / 190   0.9986  4   1   2
1   2   Micah Parsons   Harrisburg (Harrisburg, PA) Harrisburg (Harrisburg, PA) WDE 6-3 / 235   0.9982  5   1   2
2   3   Justin Shorter  South Brunswick (Monmouth Junction, NJ) ... South Brunswick (Monmouth Junction, NJ) ... WR  6-4 / 213   0.9962  8   1   1
3   4   Dan Connor  Strath Haven (Wallingford, PA)  Strath Haven (Wallingford, PA)  ILB 6-3 / 215   0.9944  13  1   2
4   5   Justin King Gateway (Monroeville, PA)   Gateway (Monroeville, PA)   CB  6-0 / 185   0.9942  15  1   2
... ... ... ... ... ... ... ... ... ... ...
242 243 Will Levis  Xavier (Middletown, CT) Xavier (Middletown, CT) PRO 6-4 / 222   0.8689  652 2   28
243 244 Troy Reeder Salesianum (Wilmington, DE) Salesianum (Wilmington, DE) ILB 6-2 / 230   0.8687  500 2   22
244 245 Jake Cooper Archbishop Wood (Warminster, PA)    Archbishop Wood (Warminster, PA)    ILB 6-1 / 220   0.8686  520 11  17
245 246 Jon Ditto   Gateway (Monroeville, PA)   Gateway (Monroeville, PA)   WR  6-3 / 221   0.8684  417 16  52
246 247 Shareef Miller  George Washington (Philadelphia, PA)    George Washington (Philadelphia, PA)    SDE 6-5 / 230   0.8681  525 12  27
247 rows × 10 columns

However, I am getting duplicates in school instead. That is because in the html, both the high school and year were found under "span" when observing the html code. That being said, is there a way to scrape the high school and year based on how the html is set up?

Any assistance on how to make this work would be truly appreciated.

Upvotes: 0

Views: 269

Answers (1)

perl
perl

Reputation: 9941

You have two spans with class meta -- the first for school and the second for year (always in this order), so you can use find_all to find both, and then extract school from the first one and year from the second one:

for tag in soup.find_all("li", class_="ri-page__list-item"):
    meta = tag.find_all("span", class_="meta")
    school = meta[0].text
    year = meta[1].text.replace('Class of ', '')

    # extract other fields...
    # data.append(...)

Upvotes: 1

Related Questions