Reputation: 80336
I have a html file that contains a line:
a = '<li><a href="?id=11&sort=&indeks=0,3" class="">H</a></li>'
When I search:
re.findall(r'href="?(\S+)"', a)
I get expected output:
['?id=11&sort=&indeks=0,3']
However, when I add "i" to the pattern like:
re.findall(r'href="?i(\S+)"', a)
I get:
[ ]
Where's the catch? Thank you in advance.
Upvotes: 2
Views: 486
Reputation: 7867
Catch here is that ?
has a special meaning in regexes, it defines zero or one occurrence of anything. So, if you want the href value from the <a>
tag, you should be using -
re.findall(r'href="(\?\S+)"', a)
and not
re.findall(r'href="?(\S+)"', a)
So, if you're not using ?'s special meaning, the you should escape it like \?
or use it like ab?
which says either a or b. Your way of using ? is improper.
Upvotes: 0
Reputation: 150947
I personally think that Python's built-in HTMLParser is incredibly useful for cases like these. I don't think this is overkill at all -- I think it's vastly more readable and maintainable than a regex.
>>> class HrefExtractor(HTMLParser.HTMLParser):
... def handle_starttag(self, tag, attrs):
... if tag == 'a':
... attrs = dict(attrs)
... if 'href' in attrs:
... print attrs['href']
...
>>> he = HrefExtractor()
>>> he.feed('<a href=foofoofoo>')
foofoofoo
Upvotes: 4
Reputation: 500157
The problem is that the ?
has a special meaning and is not being matched literally.
To fix, change your regex like so:
re.findall(r'href="\?i(\S+)"', a)
Otherwise, the ?
is treated as the optional modified applied to the "
. This happens to work (by accident) in your first example, but doesn't work in the second.
Upvotes: 4