Reputation: 6307
I'm having a really annoying problem, the answer is probably very simple yet I can't put 2 and 2 together...
I have an example of a string that'll look something like this:
<a href="javascript:void(0);" onclick="viewsite(38903);" class="followbutton">Visit</a>
The numbers 38903 will be different every time I load a page. I need a method to be able to parse these numbers every time I load the page. I've gotten far enough to grab and contain the piece of HTML code above, but can't grab just the numbers.
Again, probably a really easy thing to do, just can't figure it out. Thanks in advance!
Upvotes: 0
Views: 1237
Reputation: 5422
import re
r = re.compile('viewsite\((\d+)\)')
r.findall(s)
This will specifically look for the all-digit argument to viewsite(). You may prefer this to Andrew's answer since if other digits were to show up in the HTML string, you will start getting incorrect results.
Upvotes: 1
Reputation: 18038
>>> import re
>>> grabbed_html = '''<a href="javascript:void(0);" onclick="viewsite(38903);" class="followbutton">Visit</a>'''
>>> re.findall(r'viewsite\((\d+)\);',grabbedhtml)[0]
'38903'
Upvotes: 0
Reputation: 19983
If you're using BeautifulSoup it is dead simple to get just the onclick
string, which will make this easier. But here's a really crude way to do it:
import re
result = re.sub("\D", "", html_string)[1:]
\D
matches all non-digits, so this will remove everything in the string that isn't a number. Then take a slice to get rid of the "0" from javascript:void(0)
.
Other options: use re.search to grab series of digits and take the second group. Or use re.search to match a series of digits after a substring, where the substring is <a href="javascript:void(0);" onclick="viewsite(
.
Edit: It sounds like you are using BeautifulSoup. In that case, presumably you have an object which represents the a
tag. Let's assume that object is named a
:
import re
result = re.sub("\D", "", a['onclick'])
Upvotes: 1