root
root

Reputation: 80336

python regular expressions: html

I have a html file that contains a line:

a = '<li><a href="?id=11&amp;sort=&amp;indeks=0,3" class="">H</a></li>'

When I search:

re.findall(r'href="?(\S+)"', a)

I get expected output:

['?id=11&amp;sort=&amp;indeks=0,3']

However, when I add "i" to the pattern like:

re.findall(r'href="?i(\S+)"', a)

I get:

[ ]

Where's the catch? Thank you in advance.

Upvotes: 2

Views: 486

Answers (3)

theharshest
theharshest

Reputation: 7867

Catch here is that ? has a special meaning in regexes, it defines zero or one occurrence of anything. So, if you want the href value from the <a> tag, you should be using -

re.findall(r'href="(\?\S+)"', a)

and not

re.findall(r'href="?(\S+)"', a)

So, if you're not using ?'s special meaning, the you should escape it like \? or use it like ab? which says either a or b. Your way of using ? is improper.

Upvotes: 0

senderle
senderle

Reputation: 150947

I personally think that Python's built-in HTMLParser is incredibly useful for cases like these. I don't think this is overkill at all -- I think it's vastly more readable and maintainable than a regex.

>>> class HrefExtractor(HTMLParser.HTMLParser):
...     def handle_starttag(self, tag, attrs):
...         if tag == 'a':
...             attrs = dict(attrs)
...             if 'href' in attrs:
...                 print attrs['href']
... 
>>> he = HrefExtractor()
>>> he.feed('<a href=foofoofoo>')
foofoofoo

Upvotes: 4

NPE
NPE

Reputation: 500157

The problem is that the ? has a special meaning and is not being matched literally.

To fix, change your regex like so:

re.findall(r'href="\?i(\S+)"', a)

Otherwise, the ? is treated as the optional modified applied to the ". This happens to work (by accident) in your first example, but doesn't work in the second.

Upvotes: 4

Related Questions