Reputation: 217
Below is a section of my web scraper that scrapes a team roster from this website, puts the player information into an array, and exports the arrays to columns in a CSV file. My scraper works fine, but I would like to also pull the player's ID number, which is nested inside the player's ahref link.
<a href="/player/542882/matt-andriese">Matt Andriese</a>
As you can see from my code, I am already searching for ('a') to extract the player name (Matt Andriese), but I also want to extract the playerid number nested within the link (542882). Does anyone know how to solve this problem? Thanks in advance!
import requests
import csv
from bs4 import BeautifulSoup
page = requests.get('http://m.rays.mlb.com/roster/')
soup = BeautifulSoup(page.text, 'html.parser')
soup.find(class_='nav-tabset-container').decompose()
soup.find(class_='column secondary span-5 right').decompose()
roster = soup.find(class_='layout layout-roster')
names = [n.contents[0] for n in roster.find_all('a')]
number = [n.contents[0] for n in roster.find_all('td', index='0')]
handedness = [n.contents[0] for n in roster.find_all('td', index='3')]
height = [n.contents[0] for n in roster.find_all('td', index='4')]
weight = [n.contents[0] for n in roster.find_all('td', index='5')]
DOB = [n.contents[0] for n in roster.find_all('td', index='6')]
team = [soup.find('meta',property='og:site_name')['content']] * len(names)
with open('MLB_Active_Roster.csv', 'w', newline='') as fp:
f = csv.writer(fp)
f.writerow(['Name','Number','Hand','Height','Weight','DOB','Team'])
f.writerows(zip(names, number, handedness, height, weight, DOB, team))
Upvotes: 0
Views: 75
Reputation: 13356
If link
is the object corresponding to the tag, then you can get the href
value as link['href']
. Just to be safe, you might need to make sure there is an href
attribute in the tag by checking if 'href' in link
. After you get the URL, split
it by /
s.
In your case, you could do something like this:
ids = [n['href'].split('/')[2] for n in roster.find_all('a')]
Upvotes: 1
Reputation: 71451
You can use re
:
import requests, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('http://m.mlb.com/tb/roster').text, 'html.parser')
headers = [['td', 'dg-jersey_number'], ['td', 'dg-player_headshot', lambda x:x.find('img')['src']], ['td', 'dg-name_display_first_last', lambda x:re.findall('\d+', x.find('a')['href'])[0]], ['td', 'dg-bats_throws'], ['td', 'dg-height'], ['td', 'dg-weight'], ['td', 'dg-date_of_birth']]
def get_data(d):
return [[lambda x:x.text, None if not c else c[0]][bool(c)](d.find(a, {'class':b})) for a, b, *c in headers]
final_results = [get_data(i) for i in d.find_all('tr', {'index':re.compile('\d+')})]
Output:
[['46', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '621237', 'L/L', '6\'2"', '245lbs', '5/21/95'], ['35', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '542882', 'R/R', '6\'2"', '225lbs', '8/28/89'], ['22', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '502042', 'R/R', '6\'2"', '195lbs', '9/26/88'], ['63', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '650895', 'R/R', '6\'3"', '240lbs', '1/18/94'], ['24', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '543135', 'R/R', '6\'2"', '225lbs', '2/13/90'], ['58', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '629496', 'R/R', '6\'0"', '220lbs', '11/4/93'], ['36', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '552640', 'R/R', '6\'1"', '200lbs', '3/17/90'], ['56', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '592473', 'L/L', '6\'3"', '205lbs', '1/14/89'], ['54', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '489265', 'R/R', '5\'11"', '185lbs', '3/4/83'], ['57', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '621289', 'R/R', '5\'10"', '200lbs', '6/20/91'], ['4', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '605483', 'L/L', '6\'4"', '200lbs', '12/4/92'], ['55', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '592773', 'R/R', '6\'4"', '215lbs', '7/26/91'], ['61', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '621056', 'R/R', '6\'1"', '165lbs', '8/12/93'], ['48', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '642232', 'R/L', '6\'5"', '205lbs', '12/31/91'], ['40', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '467092', 'R/R', '6\'1"', '245lbs', '8/10/87'], ['45', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '491696', 'R/R', '6\'0"', '200lbs', '4/30/88'], ['9', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '641343', 'L/L', '6\'1"', '195lbs', '10/6/95'], ['26', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '596847', 'L/R', '6\'1"', '230lbs', '5/19/91'], ['44', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '543068', 'R/R', '6\'4"', '235lbs', '1/5/90'], ['5', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '622110', 'R/R', '6\'2"', '170lbs', '1/15/91'], ['11', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '588751', 'R/R', '6\'0"', '195lbs', '4/15/89'], ['28', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '621002', 'R/R', '5\'11"', '200lbs', '3/22/94'], ['18', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '621563', 'L/R', '6\'1"', '190lbs', '4/26/90'], ['27', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '460576', 'R/R', '6\'3"', '220lbs', '12/4/85'], ['39', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '595281', 'L/R', '6\'1"', '215lbs', '4/22/90'], ['0', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '605480', 'L/R', '5\'10"', '180lbs', '5/6/93']]
Note that the output contains the player id as the third element in each sublist.
Upvotes: 1