Reputation: 247
I am trying to scrape the table on google colab from the following web page: https://247sports.com/college/penn-state/Sport/Football/AllTimeRecruits/
Below is the python script I am trying to use...
Team = 'penn-state'
url = "https://247sports.com/college/" + str(Team) + "/Sport/Football/AllTimeRecruits/"
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}
response = requests.get(url, headers = headers).content
soup = BeautifulSoup(response, "html.parser")
data = []
for tag in soup.find_all("li", class_="ri-page__list-item"): # `[1:]` Since the first result is a table header
rank = tag.find_next("span", class_="all-time-rank").text
school = tag.find_next("span", class_="meta").text
year = tag.find_next("span", class_="meta").text
name = tag.find_next("a", class_="ri-page__name-link").text
position = tag.find_next("div", class_="position").text
height_weight = tag.find_next("div", class_="metrics").text
rating = tag.find_next("span", class_="score").text
nat_rank = tag.find_next("a", class_="natrank").text
state_rank = tag.find_next("a", class_="sttrank").text
pos_rank = tag.find_next("a", class_="posrank").text
# status = tag.find_next("p", class_="commit-date withDate").text
data.append(
{
"Rank": rank,
"Name": name,
"School": school,
"Class of": year,
"Position": position,
"Height & Weight": height_weight,
"Rating": rating,
"National Rank": nat_rank,
"State Rank": state_rank,
"Position Rank": pos_rank,
# "Date": status,
}
)
df = pd.DataFrame(data)
df
I want to get a column for what recruiting class year that player was in. For example, if a player is from "class of 2005", I want "2005" as the column value for the "year" column.
Rank Name School Class of Position Height & Weight Rating National Rank State Rank Position Rank
0 1 Derrick Williams Eleanor Roosevelt (Greenbelt, MD) Eleanor Roosevelt (Greenbelt, MD) WR 6-0 / 190 0.9986 4 1 2
1 2 Micah Parsons Harrisburg (Harrisburg, PA) Harrisburg (Harrisburg, PA) WDE 6-3 / 235 0.9982 5 1 2
2 3 Justin Shorter South Brunswick (Monmouth Junction, NJ) ... South Brunswick (Monmouth Junction, NJ) ... WR 6-4 / 213 0.9962 8 1 1
3 4 Dan Connor Strath Haven (Wallingford, PA) Strath Haven (Wallingford, PA) ILB 6-3 / 215 0.9944 13 1 2
4 5 Justin King Gateway (Monroeville, PA) Gateway (Monroeville, PA) CB 6-0 / 185 0.9942 15 1 2
... ... ... ... ... ... ... ... ... ... ...
242 243 Will Levis Xavier (Middletown, CT) Xavier (Middletown, CT) PRO 6-4 / 222 0.8689 652 2 28
243 244 Troy Reeder Salesianum (Wilmington, DE) Salesianum (Wilmington, DE) ILB 6-2 / 230 0.8687 500 2 22
244 245 Jake Cooper Archbishop Wood (Warminster, PA) Archbishop Wood (Warminster, PA) ILB 6-1 / 220 0.8686 520 11 17
245 246 Jon Ditto Gateway (Monroeville, PA) Gateway (Monroeville, PA) WR 6-3 / 221 0.8684 417 16 52
246 247 Shareef Miller George Washington (Philadelphia, PA) George Washington (Philadelphia, PA) SDE 6-5 / 230 0.8681 525 12 27
247 rows × 10 columns
However, I am getting duplicates in school instead. That is because in the html, both the high school and year were found under "span" when observing the html code. That being said, is there a way to scrape the high school and year based on how the html is set up?
Any assistance on how to make this work would be truly appreciated.
Upvotes: 0
Views: 269
Reputation: 9941
You have two spans
with class meta
-- the first for school and the second for year (always in this order), so you can use find_all
to find both, and then extract school
from the first one and year
from the second one:
for tag in soup.find_all("li", class_="ri-page__list-item"):
meta = tag.find_all("span", class_="meta")
school = meta[0].text
year = meta[1].text.replace('Class of ', '')
# extract other fields...
# data.append(...)
Upvotes: 1