hao_maike
hao_maike

Reputation: 3039

Python: store many regex matches in tuple?

I'm trying to make a simple Python-based HTML parser using regular expressions. My problem is trying to get my regex search query to find all the possible matches, then store them in a tuple.

Let's say I have a page with the following stored in the variable HTMLtext:

<ul>
<li class="active"><b><a href="/blog/home">Back to the index</a></b></li>
<li><b><a href="/blog/about">About Me!</a></b></li>
<li><b><a href="/blog/music">Audio Production</a></b></li>
<li><b><a href="/blog/photos">Gallery</a></b></li>
<li><b><a href="/blog/stuff">Misc</a></b></li>
<li><b><a href="/blog/contact">Shoot me an email</a></b></li>
</ul>

I want to perform a regex search on this text and return a tuple containing the last URL directory of each link. So, I'd like to return something like this:

pages = ["home", "about", "music", "photos", "stuff", "contact"]

So far, I'm able to use regex to search for one result:

pages = [re.compile('<a href="/blog/(.*)">').search(HTMLtext).group(1)]

Running this expression makespages = ['home'].

How can I get the regex search to continue for the whole text, appending the matched text to this tuple?

(Note: I know I probably should NOT be using regex to parse HTML. But I want to know how to do this anyway.)

Upvotes: 5

Views: 3026

Answers (5)

ovgolovin
ovgolovin

Reputation: 13410

Use findall function of re module:

pages = re.findall('<a href="/blog/([^"]*)">',HTMLtext)
print(pages)

Output:

['home', 'about', 'music', 'photos', 'stuff', 'contact']

Upvotes: 2

tchrist
tchrist

Reputation: 80405

Your pattern won’t work on all inputs, including yours. The .* is going to be too greedy (technically, it finds a maximal match), causing it to be the first href and the last corresponding close. The two simplest ways to fix this is to use either a minimal match, or else a negates character class.

# minimal match approach
pages = re.findall(r'<a\s+href="/blog/(.+?)">', 
                   full_html_text, re.I + re.S)

# negated charclass approach
pages = re.findall(r'<a\s+href="/blog/([^"]+)">',
                   full_html_text, re.I)

Obligatory Warning

For simple and reasonably well-constrained text, regexes are just fine; after all, that’s why we use regex search-and-replace in our text editors when editing HTML! However, it gets more and more complicated the less you know about the input, such as

  • if there’s some other field intervening between the <a and the href, like <a title="foo" href="bar">
  • casing issues like <A HREF='foo'>
  • whitespace issues
  • alternate quotes like href='/foo/bar' instead of href="/foo/bar"
  • embedded HTML comments

That’s not an exclusive list of concerns; there are others. And so, using regexes on HTML thus is possible but whether it’s expedient depends on too many other factors to judge.

However, from the little example you’ve shown, it looks perfectly ok for your own case. You just have to spiff up your pattern and call the right method.

Upvotes: 2

Mariusz Jamro
Mariusz Jamro

Reputation: 31653

To find all results use findall(). Also you need to compile the re only once and then you can reuse it.

href_re = re.compile('<a href="/blog/(.*)">')  # Compile the regexp once

pages = href_re.findall(HTMLtext)  # Find all matches - ["home", "about",

Upvotes: 1

Raymond Hettinger
Raymond Hettinger

Reputation: 226376

The re.findall() function and the re.finditer() function are used to find multiple matches.

Upvotes: 1

Simeon Visser
Simeon Visser

Reputation: 122376

Use findall instead of search:

>>> pages = re.compile('<a href="/blog/(.*)">').findall(HTMLtext)
>>> pages
['home', 'about', 'music', 'photos', 'stuff', 'contact']

Upvotes: 1

Related Questions