gcc72
gcc72

Reputation: 11

Extract part of a URL store in list within a dataframe - Python

I am trying to extract the numeric part only, in the example below 25709, and add it a variable, lets call that athleteID, that I can later add to a dynamic URL to iterate through and use to send a search request:

'<a href="../athletehistory/?athleteNumber=25709" target="_top">Zola Budd</a>'

I have a list of these URLs (or part URLs) stored in a list within a dataframe and I have iterated twice over this dataframe using the split('=') function and managed to get it to the point below.

 i=[]
 id_list=[]
 for id in df2['athleteURL']:
     i = id.split('\=')
     id_list.append(i)
 print(id_list)

Which then produces a list, one line as an example below:

 '<a href', '"../athletehistory/?athleteNumber', '25709" target', '"_top">Zola Budd</a>'

I then did a second iteration using '('"')' and got it to the below:

 id_list2=[]


 for id2 in id_list[2]:
     j = id2.split('\"')
     id_list2.append(j)

 #print(id_list2[2])

 athleteIDnumber = id_list2[2]
 print(athleteIDnumber)

 ['2967288', ' target']

However this is where I am now stuck as it appears to be one element within a list plus I am not sure this is the most efficient way to extract this line as I also struggled with using other regex functions.

Any advice or support would be appreciated. Thanks Chris

Upvotes: 1

Views: 201

Answers (1)

buran
buran

Reputation: 14273

from urllib.parse import urlparse, parse_qs
from bs4 import BeautifulSoup

spam = '<a href="../athletehistory/?athleteNumber=25709" target="_top">Zola Budd</a>'

def get_athlete_number(html):
    soup = BeautifulSoup(html, 'html.parser')
    href = soup.find('a').get('href')
    return parse_qs(urlparse(href).query).get('athleteNumber', [None])[0]

print(get_athlete_number(spam))

output

25709

Use bs4 to parse the html and get the url. Use urllib.parse from standard library to parse the url. Define a function and apply it to column with the html values. Note that the function returns str

Upvotes: 1

Related Questions