Dustin
Dustin

Reputation: 6307

Python - Parse String

I'm having a really annoying problem, the answer is probably very simple yet I can't put 2 and 2 together...

I have an example of a string that'll look something like this:

<a href="javascript:void(0);" onclick="viewsite(38903);" class="followbutton">Visit</a>

The numbers 38903 will be different every time I load a page. I need a method to be able to parse these numbers every time I load the page. I've gotten far enough to grab and contain the piece of HTML code above, but can't grab just the numbers.

Again, probably a really easy thing to do, just can't figure it out. Thanks in advance!

Upvotes: 0

Views: 1237

Answers (3)

BrainCore
BrainCore

Reputation: 5422

import re
r = re.compile('viewsite\((\d+)\)')
r.findall(s)

This will specifically look for the all-digit argument to viewsite(). You may prefer this to Andrew's answer since if other digits were to show up in the HTML string, you will start getting incorrect results.

Upvotes: 1

mshsayem
mshsayem

Reputation: 18038

>>> import re
>>> grabbed_html = '''<a href="javascript:void(0);" onclick="viewsite(38903);" class="followbutton">Visit</a>'''
>>> re.findall(r'viewsite\((\d+)\);',grabbedhtml)[0]
'38903'

Upvotes: 0

Andrew Gorcester
Andrew Gorcester

Reputation: 19983

If you're using BeautifulSoup it is dead simple to get just the onclick string, which will make this easier. But here's a really crude way to do it:

import re
result = re.sub("\D", "", html_string)[1:]

\D matches all non-digits, so this will remove everything in the string that isn't a number. Then take a slice to get rid of the "0" from javascript:void(0).

Other options: use re.search to grab series of digits and take the second group. Or use re.search to match a series of digits after a substring, where the substring is <a href="javascript:void(0);" onclick="viewsite(.

Edit: It sounds like you are using BeautifulSoup. In that case, presumably you have an object which represents the a tag. Let's assume that object is named a:

import re
result = re.sub("\D", "", a['onclick'])

Upvotes: 1

Related Questions