Reputation: 11
I am trying to extract the numeric part only, in the example below 25709, and add it a variable, lets call that athleteID, that I can later add to a dynamic URL to iterate through and use to send a search request:
'<a href="../athletehistory/?athleteNumber=25709" target="_top">Zola Budd</a>'
I have a list of these URLs (or part URLs) stored in a list within a dataframe and I have iterated twice over this dataframe using the split('=') function and managed to get it to the point below.
i=[]
id_list=[]
for id in df2['athleteURL']:
i = id.split('\=')
id_list.append(i)
print(id_list)
Which then produces a list, one line as an example below:
'<a href', '"../athletehistory/?athleteNumber', '25709" target', '"_top">Zola Budd</a>'
I then did a second iteration using '('"')' and got it to the below:
id_list2=[]
for id2 in id_list[2]:
j = id2.split('\"')
id_list2.append(j)
#print(id_list2[2])
athleteIDnumber = id_list2[2]
print(athleteIDnumber)
['2967288', ' target']
However this is where I am now stuck as it appears to be one element within a list plus I am not sure this is the most efficient way to extract this line as I also struggled with using other regex functions.
Any advice or support would be appreciated. Thanks Chris
Upvotes: 1
Views: 201
Reputation: 14273
from urllib.parse import urlparse, parse_qs
from bs4 import BeautifulSoup
spam = '<a href="../athletehistory/?athleteNumber=25709" target="_top">Zola Budd</a>'
def get_athlete_number(html):
soup = BeautifulSoup(html, 'html.parser')
href = soup.find('a').get('href')
return parse_qs(urlparse(href).query).get('athleteNumber', [None])[0]
print(get_athlete_number(spam))
output
25709
Use bs4
to parse the html and get the url. Use urllib.parse
from standard library to parse the url. Define a function and apply it to column with the html values. Note that the function returns str
Upvotes: 1