Reputation: 394
I'm having some trouble extracting player ID's from a site's HTML. I've done this before and not had an issue, but the href's for this specific html are a bit different and have me stumped. Below is a portion of the HTML and the script I've put together that returns {} for each row after printing. The ID below is 'lynnla02' and appears in the HTML twice so extracting either version would be fine. Any help would be greatly appreciated.
HTML:
<tr data-row="248">
<th scope="row" class="right " data-stat="ranker" csk="240">1</th>
<td class="left " data-append-csv="lynnla01" data-stat="player">
<a href="/players/l/lynnla01.shtml">Lance Lynn</a>
One of my attempts:
ID = []
for tag in soup.select('a[href^=/players]'):
link = tag['href']
query = parse_qs(link)
ID.append(query)
print(ID)
Upvotes: 0
Views: 417
Reputation: 2688
Using built-in and BeautifulSoup
from bs4 import BeautifulSoup as bs
html = '''<tr data-row="248">
<th scope="row" class="right " data-stat="ranker" csk="240">1</th>
<td class="left " data-append-csv="lynnla01" data-stat="player">
<a href="/players/l/lynnla01.shtml">Lance Lynn</a>'''
soup = bs(html, 'lxml')
hrefs = soup.find_all('a')
for a_tag in hrefs:
if a_tag['href'].startswith('/players'):
print(a_tag['href'])
With regular expressions:
regex = re.compile('/players.+')
a_tags = soup.find_all('a', href=regex)
#print (a_tags), you can loop for i... and do print(i['href'])
To print the specific piece of string you asked for:
for i in a_tags:
only_specific = re.match(regex, i['href'])
print(only_specific.group(1))
Upvotes: 2