lanthica
lanthica

Reputation: 250

Matching urls in html link elements using regex

I'm trying to extract URLs that are within and match both tags that have a close as well as open/unclosed that have hrefs in them.

That said here is the regex:

<(\w+)\s[^<>]*?href=[\'"]([\w$-_.+!*\'\(\),%\/:#=?~\[\]!&@;]*?)[\'"].*?>((.+?)</\1>)?

Here is some sample data:

<link href='http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5' /><table><tr><td>
<a href='http://blah.net/message/new/'>Click here and submit your updated information </a> <br><br>Thanking you in advance for your attention to this matter.<br><br>

Regards, <br>
Debbi Hamilton
</td></tr><tr><td><br><br></td></tr></table>

And putting this in http://re-try.appspot.com/ or http://www.regexplanet.com/advanced/java/index.html (yes I know it's for java) yields precisely what I am trying to get: the tag, the href text, the enclosed text with the end tag, and the enclosed text by itself.

That said, when I use this in my python app, the last two groups (enclosed text w/ tag, and enclosed text by itself) are always None. I suspect it has something to do with the group within a group with a back reference: ((.+?))?

Also, I should mention that I specifically use:
    matcher = re.compile(...)
    matcher.findall(data)

but the groups being None appears in both matcher.search(data) and matcher.match(data)

Any help would be greatly appreciated!

Upvotes: 0

Views: 154

Answers (2)

eyquem
eyquem

Reputation: 27585

pat = ('<'
       '(\w+)\s[^<>]*?'
       'href='
       '([\'"])'
       '([\w$-_.+!*\'(\),%/:#=?~[\]!&@;]*?)'
       '(?:\\2)'
       '.*?'
       '>'
       '((.+?)</\\1>)?')

You just needed to put \\1 or r'...' as did DSM

Note that I made minor modifications in your pattern: there were two !
writing [\] instead of \[\] because it's clear for the regex machinery that [ after a first [ is a simple character
the same for (\)

Note that I did a group of ([\'"]) and put (?:\\2) to catch the same at the end

Upvotes: 1

DSM
DSM

Reputation: 353569

Respectfully, what you want to do is very silly, and you shouldn't do it.

That said, it seems to work for me (by which I mean gives non-None results):

>>> reg = r'<(\w+)\s[^<>]*?href=[\'"]([\w$-_.+!*\'\(\),%\/:#=?~\[\]!&@;]*?)[\'"].*?>((.+?)</\1>)?'
... 
>>> d = """
<link href='http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5' /><table><tr><td>
<a href='http://blah.net/message/new/'>Click here and submit your updated information </a> <br><br>Thanking you in advance for your attention to this matter.<br><br>
Regards, <br>
Debbi Hamilton
</td></tr><tr><td><br><br></td></tr></table>
"""
>>> 
>>> re.findall(reg, d)
[('link', 'http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5', '', ''), 
('a', 'http://blah.net/message/new/', 'Click here and submit your updated information </a>', 'Click here and submit your updated information ')]

My guess is that you forgot to use a raw string when making the regular expression, i.e.

>>> reg = '<(\w+)\s[^<>]*?href=[\'"]([\w$-_.+!*\'\(\),%\/:#=?~\[\]!&@;]*?)[\'"].*?>((.+?)</\1>)?'
... 
>>> re.findall(reg, d)
[('link', 'http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5', '', ''), 
('a', 'http://blah.net/message/new/', '', '')]

Upvotes: 1

Related Questions