Reputation: 35
I'm trying to extract player position from many players' webpages (here's an example for Malcolm Brogdon). I'm able to extract Malcolm Brogdon's position using the following code:
player_id = 'malcolm-brogdon-1'
# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np
url = "https://www.sports-reference.com/cbb/players/{}.html".format(player_id)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
pos = page_soup.p.find("strong").next_sibling.strip()
pos
However, I want to be able to do this in a more dynamic way (that is, to locate "Position:" and then find what comes after). There are other players for which the webpage is structured slightly differently, and my current code wouldn't return position (i.e. Cat Barber).
I've tried doing something like page_soup.find("strong", text="Position:")
but that doesn't seem to work.
Upvotes: 1
Views: 52
Reputation: 195613
You can select the element that contains the text "Position:" and then the next text sibling:
import requests
from bs4 import BeautifulSoup
url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
pos = soup.select_one('strong:contains("Position")').find_next_sibling(text=True).strip()
print(pos)
Prints:
Guard
EDIT: Another version:
import requests
from bs4 import BeautifulSoup
url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
pos = (
soup.find("strong", text=lambda t: "Position" in t)
.find_next_sibling(text=True)
.strip()
)
print(pos)
Upvotes: 1