Reputation: 250
I'm trying to extract URLs that are within and match both tags that have a close as well as open/unclosed that have hrefs in them.
That said here is the regex:
<(\w+)\s[^<>]*?href=[\'"]([\w$-_.+!*\'\(\),%\/:#=?~\[\]!&@;]*?)[\'"].*?>((.+?)</\1>)?
Here is some sample data:
<link href='http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5' /><table><tr><td>
<a href='http://blah.net/message/new/'>Click here and submit your updated information </a> <br><br>Thanking you in advance for your attention to this matter.<br><br>
Regards, <br>
Debbi Hamilton
</td></tr><tr><td><br><br></td></tr></table>
And putting this in http://re-try.appspot.com/ or http://www.regexplanet.com/advanced/java/index.html (yes I know it's for java) yields precisely what I am trying to get: the tag, the href text, the enclosed text with the end tag, and the enclosed text by itself.
That said, when I use this in my python app, the last two groups (enclosed text w/ tag, and enclosed text by itself) are always None
. I suspect it has something to do with the group within a group with a back reference: ((.+?))?
Also, I should mention that I specifically use: matcher = re.compile(...) matcher.findall(data)
but the groups being None
appears in both matcher.search(data)
and matcher.match(data)
Any help would be greatly appreciated!
Upvotes: 0
Views: 154
Reputation: 27585
pat = ('<'
'(\w+)\s[^<>]*?'
'href='
'([\'"])'
'([\w$-_.+!*\'(\),%/:#=?~[\]!&@;]*?)'
'(?:\\2)'
'.*?'
'>'
'((.+?)</\\1>)?')
You just needed to put \\1
or r'...'
as did DSM
Note that I made minor modifications in your pattern:
there were two !
writing [\]
instead of \[\]
because it's clear for the regex machinery that [
after a first [
is a simple character
the same for (\)
Note that I did a group of ([\'"])
and put (?:\\2)
to catch the same at the end
Upvotes: 1
Reputation: 353569
Respectfully, what you want to do is very silly, and you shouldn't do it.
That said, it seems to work for me (by which I mean gives non-None results):
>>> reg = r'<(\w+)\s[^<>]*?href=[\'"]([\w$-_.+!*\'\(\),%\/:#=?~\[\]!&@;]*?)[\'"].*?>((.+?)</\1>)?'
...
>>> d = """
<link href='http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5' /><table><tr><td>
<a href='http://blah.net/message/new/'>Click here and submit your updated information </a> <br><br>Thanking you in advance for your attention to this matter.<br><br>
Regards, <br>
Debbi Hamilton
</td></tr><tr><td><br><br></td></tr></table>
"""
>>>
>>> re.findall(reg, d)
[('link', 'http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5', '', ''),
('a', 'http://blah.net/message/new/', 'Click here and submit your updated information </a>', 'Click here and submit your updated information ')]
My guess is that you forgot to use a raw string when making the regular expression, i.e.
>>> reg = '<(\w+)\s[^<>]*?href=[\'"]([\w$-_.+!*\'\(\),%\/:#=?~\[\]!&@;]*?)[\'"].*?>((.+?)</\1>)?'
...
>>> re.findall(reg, d)
[('link', 'http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5', '', ''),
('a', 'http://blah.net/message/new/', '', '')]
Upvotes: 1