Reputation: 111

Finding more than one occurence using a regular expression

Is it possible to capture all of the information in href using one regular expression?

For example:

<div id="w1">
    <ul id="u1">
        <li><a id='1' href='book'>book<sup>1</sup></a></li>
        <li><a id='2' href='book-2'>book<sup>2</sup></a></li>
        <li><a id='3' href='book-3'>book<sup>3</sup></a></li>
    </ul>
</div>

I want to get book, book-2 and book-3.

Upvotes: 0

Answers (3)

HFX

Reputation: 592

Using custom class extends HTMLParser:

class MyHTMLParser(HTMLParser):
    def __init__(self,*args,**kw):
        super().__init__(*args,**kw)
            self.anchorlist=[]

    def handle_starttag(self,tag,attrs):
        if tag == 'a':
            for attribute in attrs:
                if attribute[0] == 'href':
                    self.anchorlist.append(attribute[1])

This will put all of the URLs in anchorlist.

By the way, it's in Python 3.x

Upvotes: 0

sshashank124

Reputation: 32189

You can do that with the following regex:

<a id='\d+' href='([\w-]+)'

import re

s = '''<div id="w1"><ul id="u1"><li><a id='1' href='book'>book<sup>1</sup></a></li><li><a id='2' href='book-2'>book<sup>2</sup></a></li><li><a id='3' href='book-3'>book<sup>3</sup></a></li></ul></div>'''

>>> print re.findall(r"<a id='\d+' href='([\w-]+)'", s)
['book', 'book-2', 'book-3']

Upvotes: 0

Pedro Lobito

Reputation: 98861

Short and simple:

html = '<div id="w1"><ul id="u1"><li><a id='1' href='book'>book<sup>1</sup></a></li><li><a id='2' href='book-2'>book<sup>2</sup></a></li><li><a id='3' href='book-3'>book<sup>3</sup></a></li></ul></div>'
result = re.findall("href='(.*?)'", html)

EXPLANATION:

Match the character string “href='” literally (case sensitive) «href='»
Match the regex below and capture its match into backreference number 1 «(.*?)»
   Match any single character that is NOT a line break character (line feed) «.*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “'” literally «'»

Upvotes: 2

Finding more than one occurence using a regular expression

Answers (3)

Related Questions