Matching urls in html link elements using regex

Question

I'm trying to extract URLs that are within and match both tags that have a close as well as open/unclosed that have hrefs in them.

That said here is the regex:

<(\w+)\s[^<>]*?href=[\'"]([\w$-_.+!*\',%/:#=?~!&@;]*?)[\'"].*?>((.+?))?

Here is some sample data:


Click here and submit your updated information  

Thanking you in advance for your attention to this matter.



Regards, 

Debbi Hamilton

And putting this in http://re-try.appspot.com/ or http://www.regexplanet.com/advanced/java/index.html (yes I know it's for java) yields precisely what I am trying to get: the tag, the href text, the enclosed text with the end tag, and the enclosed text by itself.

That said, when I use this in my python app, the last two groups (enclosed text w/ tag, and enclosed text by itself) are always None. I suspect it has something to do with the group within a group with a back reference: ((.+?))?

Also, I should mention that I specifically use:
    matcher = re.compile(...)
    matcher.findall(data)

but the groups being None appears in both matcher.search(data) and matcher.match(data)

Any help would be greatly appreciated!

DSM · Accepted Answer

Respectfully, what you want to do is very silly, and you shouldn't do it.

That said, it seems to work for me (by which I mean gives non-None results):

>>> reg = r'<(\w+)\s[^<>]*?href=[\'"]([\w$-_.+!*\',%/:#=?~!&@;]*?)[\'"].*?>((.+?))?'
... 
>>> d = """

Click here and submit your updated information  

Thanking you in advance for your attention to this matter.


Regards, 

Debbi Hamilton



"""
>>> 
>>> re.findall(reg, d)
[('link', 'http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5', '', ''), 
('a', 'http://blah.net/message/new/', 'Click here and submit your updated information ', 'Click here and submit your updated information ')]

My guess is that you forgot to use a raw string when making the regular expression, i.e.

>>> reg = '<(\w+)\s[^<>]*?href=[\'"]([\w$-_.+!*\',%/:#=?~!&@;]*?)[\'"].*?>((.+?))?'
... 
>>> re.findall(reg, d)
[('link', 'http://blah.net/message/new/?stopemails.aspx?id=5A42FDF5', '', ''), 
('a', 'http://blah.net/message/new/', '', '')]

Matching urls in html link elements using regex

Answers (2)

Related Questions