Nate Walker
Nate Walker

Reputation: 217

Extracting a Specific Part of a Link Using Beautiful Soup

Below is a section of my web scraper that scrapes a team roster from this website, puts the player information into an array, and exports the arrays to columns in a CSV file. My scraper works fine, but I would like to also pull the player's ID number, which is nested inside the player's ahref link.

<a href="/player/542882/matt-andriese">Matt Andriese</a>

As you can see from my code, I am already searching for ('a') to extract the player name (Matt Andriese), but I also want to extract the playerid number nested within the link (542882). Does anyone know how to solve this problem? Thanks in advance!

import requests
import csv
from bs4 import BeautifulSoup

page = requests.get('http://m.rays.mlb.com/roster/')
soup = BeautifulSoup(page.text, 'html.parser')

soup.find(class_='nav-tabset-container').decompose()
soup.find(class_='column secondary span-5 right').decompose()

roster = soup.find(class_='layout layout-roster')
names = [n.contents[0] for n in roster.find_all('a')]
number = [n.contents[0] for n in roster.find_all('td', index='0')]
handedness = [n.contents[0] for n in roster.find_all('td', index='3')]
height = [n.contents[0] for n in roster.find_all('td', index='4')]
weight = [n.contents[0] for n in roster.find_all('td', index='5')]
DOB = [n.contents[0] for n in roster.find_all('td', index='6')]
team = [soup.find('meta',property='og:site_name')['content']] * len(names)

with open('MLB_Active_Roster.csv', 'w', newline='') as fp:
    f = csv.writer(fp)
    f.writerow(['Name','Number','Hand','Height','Weight','DOB','Team'])
    f.writerows(zip(names, number, handedness, height, weight, DOB, team))

Upvotes: 0

Views: 75

Answers (2)

Sufian Latif
Sufian Latif

Reputation: 13356

If link is the object corresponding to the tag, then you can get the href value as link['href']. Just to be safe, you might need to make sure there is an href attribute in the tag by checking if 'href' in link. After you get the URL, split it by /s.

In your case, you could do something like this:

ids = [n['href'].split('/')[2] for n in roster.find_all('a')]

Upvotes: 1

Ajax1234
Ajax1234

Reputation: 71451

You can use re:

import requests, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('http://m.mlb.com/tb/roster').text, 'html.parser')
headers = [['td', 'dg-jersey_number'], ['td', 'dg-player_headshot', lambda x:x.find('img')['src']], ['td', 'dg-name_display_first_last', lambda x:re.findall('\d+', x.find('a')['href'])[0]], ['td', 'dg-bats_throws'], ['td', 'dg-height'], ['td', 'dg-weight'], ['td', 'dg-date_of_birth']]
def get_data(d):
  return [[lambda x:x.text, None if not c else c[0]][bool(c)](d.find(a, {'class':b})) for a, b, *c in headers]

final_results = [get_data(i) for i in d.find_all('tr', {'index':re.compile('\d+')})]

Output:

[['46', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '621237', 'L/L', '6\'2"', '245lbs', '5/21/95'], ['35', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '542882', 'R/R', '6\'2"', '225lbs', '8/28/89'], ['22', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '502042', 'R/R', '6\'2"', '195lbs', '9/26/88'], ['63', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '650895', 'R/R', '6\'3"', '240lbs', '1/18/94'], ['24', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '543135', 'R/R', '6\'2"', '225lbs', '2/13/90'], ['58', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '629496', 'R/R', '6\'0"', '220lbs', '11/4/93'], ['36', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '552640', 'R/R', '6\'1"', '200lbs', '3/17/90'], ['56', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '592473', 'L/L', '6\'3"', '205lbs', '1/14/89'], ['54', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '489265', 'R/R', '5\'11"', '185lbs', '3/4/83'], ['57', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '621289', 'R/R', '5\'10"', '200lbs', '6/20/91'], ['4', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '605483', 'L/L', '6\'4"', '200lbs', '12/4/92'], ['55', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '592773', 'R/R', '6\'4"', '215lbs', '7/26/91'], ['61', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '621056', 'R/R', '6\'1"', '165lbs', '8/12/93'], ['48', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '642232', 'R/L', '6\'5"', '205lbs', '12/31/91'], ['40', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '467092', 'R/R', '6\'1"', '245lbs', '8/10/87'], ['45', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '491696', 'R/R', '6\'0"', '200lbs', '4/30/88'], ['9', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '641343', 'L/L', '6\'1"', '195lbs', '10/6/95'], ['26', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '596847', 'L/R', '6\'1"', '230lbs', '5/19/91'], ['44', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '543068', 'R/R', '6\'4"', '235lbs', '1/5/90'], ['5', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '622110', 'R/R', '6\'2"', '170lbs', '1/15/91'], ['11', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '588751', 'R/R', '6\'0"', '195lbs', '4/15/89'], ['28', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '621002', 'R/R', '5\'11"', '200lbs', '3/22/94'], ['18', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '621563', 'L/R', '6\'1"', '190lbs', '4/26/90'], ['27', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '460576', 'R/R', '6\'3"', '220lbs', '12/4/85'], ['39', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '595281', 'L/R', '6\'1"', '215lbs', '4/22/90'], ['0', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/[email protected]', '605480', 'L/R', '5\'10"', '180lbs', '5/6/93']]

Note that the output contains the player id as the third element in each sublist.

Upvotes: 1

Related Questions