Reputation: 3039
I'm trying to make a simple Python-based HTML parser using regular expressions. My problem is trying to get my regex search query to find all the possible matches, then store them in a tuple.
Let's say I have a page with the following stored in the variable HTMLtext
:
<ul>
<li class="active"><b><a href="/blog/home">Back to the index</a></b></li>
<li><b><a href="/blog/about">About Me!</a></b></li>
<li><b><a href="/blog/music">Audio Production</a></b></li>
<li><b><a href="/blog/photos">Gallery</a></b></li>
<li><b><a href="/blog/stuff">Misc</a></b></li>
<li><b><a href="/blog/contact">Shoot me an email</a></b></li>
</ul>
I want to perform a regex search on this text and return a tuple containing the last URL directory of each link. So, I'd like to return something like this:
pages = ["home", "about", "music", "photos", "stuff", "contact"]
So far, I'm able to use regex to search for one result:
pages = [re.compile('<a href="/blog/(.*)">').search(HTMLtext).group(1)]
Running this expression makespages = ['home']
.
How can I get the regex search to continue for the whole text, appending the matched text to this tuple?
(Note: I know I probably should NOT be using regex to parse HTML. But I want to know how to do this anyway.)
Upvotes: 5
Views: 3026
Reputation: 13410
Use findall
function of re
module:
pages = re.findall('<a href="/blog/([^"]*)">',HTMLtext)
print(pages)
Output:
['home', 'about', 'music', 'photos', 'stuff', 'contact']
Upvotes: 2
Reputation: 80405
Your pattern won’t work on all inputs, including yours. The .*
is going to be too greedy (technically, it finds a maximal match), causing it to be the first href and the last corresponding close. The two simplest ways to fix this is to use either a minimal match, or else a negates character class.
# minimal match approach
pages = re.findall(r'<a\s+href="/blog/(.+?)">',
full_html_text, re.I + re.S)
# negated charclass approach
pages = re.findall(r'<a\s+href="/blog/([^"]+)">',
full_html_text, re.I)
For simple and reasonably well-constrained text, regexes are just fine; after all, that’s why we use regex search-and-replace in our text editors when editing HTML! However, it gets more and more complicated the less you know about the input, such as
<a
and the href
, like <a title="foo" href="bar">
<A HREF='foo'>
href='/foo/bar'
instead of href="/foo/bar"
That’s not an exclusive list of concerns; there are others. And so, using regexes on HTML thus is possible but whether it’s expedient depends on too many other factors to judge.
However, from the little example you’ve shown, it looks perfectly ok for your own case. You just have to spiff up your pattern and call the right method.
Upvotes: 2
Reputation: 31653
To find all results use findall()
. Also you need to compile the re
only once and then you can reuse it.
href_re = re.compile('<a href="/blog/(.*)">') # Compile the regexp once
pages = href_re.findall(HTMLtext) # Find all matches - ["home", "about",
Upvotes: 1
Reputation: 226376
The re.findall() function and the re.finditer() function are used to find multiple matches.
Upvotes: 1
Reputation: 122376
Use findall
instead of search
:
>>> pages = re.compile('<a href="/blog/(.*)">').findall(HTMLtext)
>>> pages
['home', 'about', 'music', 'photos', 'stuff', 'contact']
Upvotes: 1