Reputation: 437
I have web scraping using python that gets this code from the site:
<a href="javascript:document.frmMain.action.value='display_physician_info';document.frmMain.PhysicianID.value=1234567;document.frmMain.submit();" title="For more information, click here.">JOHN, DOE</a>
I want to parse the specific value of href like the value of PhysicianID which is 1234567 inside "document.frmMain.PhysicianID.value"
Currently I'm getting the whole href text something like this:
for i in soup.select('.data'):
name = i.find('a', attrs = {'title': 'For more information, click here.'})
Any idea? Thanks in advance.
Upvotes: 0
Views: 334
Reputation: 22440
Or without regex:
from bs4 import BeautifulSoup
content = """
<a href="javascript:document.frmMain.action.value='display_physician_info';document.frmMain.PhysicianID.value=1234567;document.frmMain.submit();" title="For more information, click here.">JOHN, DOE</a>
"""
soup = BeautifulSoup(content,"lxml")
item = soup.select_one("a")['href'].split("PhysicianID.value=")[1].split(";")[0]
print(item)
Output:
1234567
Upvotes: 1
Reputation: 11912
Getting in href
itself is easy with BeautifulSoup
once you've got the link itself:
href = name['href']
Then you can use regex with the re
module:
import re
match = re.search(r'document.frmMain.PhysicianID.value=\d*;', href).group()
value = re.search(r'\d+', match).group()
print(value) #prints 1234567
Putting it all together with your code:
import re
for i in soup.select('.data'):
name = i.find('a', attrs = {'title': 'For more information, click here.'})
match = re.search(r'document.frmMain.PhysicianID.value=\d*;', href).group()
value = re.search(r'\d+', match).group()
print(value) #prints 1234567
Upvotes: 1