SteveC
SteveC

Reputation: 408

Getting all instances of a regular expression in python with

I'm trying to get all the link innerHTML's using the following

import re

s = '<div><a href="page1.html" title="page1">Go to 1</a>, <a href="page2.html" title="page2">Go to page 2</a><a href="page3.html" title="page3">Go to page 3</a>, <a href="page4.html" title="page4">Go to page 4</a></div>'
match = re.findall(r'<a.*>(.*)</a>', s)

for string in match:
    print(string)

But I'm only getting the last occurrence, "Go to page 4" I think it's seeing one big string and several matching regex's within, which are treated as over-lapping and ignored. So, how do I get a collection that matches

['Go to page 1', 'Go to page 2', 'Go to page 3', 'Go to page 4']

Upvotes: 1

Views: 455

Answers (3)

Jon Clements
Jon Clements

Reputation: 142206

Your immediate problem is that regexp's are greedy, that is they will attempt to consume the longest string possible. So you're correct that it's finding up until the last </a> it can. Change it to be non-greedy (.*?):

match = re.findall(r'<a.*?>(.*?)</a>', s)
                             ^

However, this is a horrible way of parsing HTML and is not robust, and will break on the smallest of changes.

Here's a far better way of doing it:

from bs4 import BeautifulSoup

s = '<div><a href="page1.html" title="page1">Go to 1</a>, <a href="page2.html" title="page2">Go to page 2</a><a href="page3.html" title="page3">Go to page 3</a>, <a href="page4.html" title="page4">Go to page 4</a></div>'
soup = BeautifulSoup(s)
print [el.string for el in soup('a')]
# [u'Go to 1', u'Go to page 2', u'Go to page 3', u'Go to page 4']

Then, you can use the power of that to also get the href as well as the text, eg:

print [[el.string, el['href'] ]for el in soup('a', href=True)]
# [[u'Go to 1', 'page1.html'], [u'Go to page 2', 'page2.html'], [u'Go to page 3', 'page3.html'], [u'Go to page 4', 'page4.html']]

Upvotes: 2

FastTurtle
FastTurtle

Reputation: 2311

I would avoid parsing HTML using regex at ALL costs. Check out this article and this SO post as per why. But to sum it up...

Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp

Instead I would take a look at a python HTML parsing package like BeautifulSoup or pyquery. They provide nice interfaces to traverse, retrieve, and edit HTML.

Upvotes: 2

Patrick the Cat
Patrick the Cat

Reputation: 2158

I suggest using lxml:

from lxml import etree

s = 'some html'
tree = etree.fromstring(s)
for ele in tree.iter('*'):
    #do something

It provides iterParse function for large file process, also takes in file-like object like urllib2.request object. I have been using this for a long time for parsing html and xml.

See: http://lxml.de/tutorial.html#the-element-class

Upvotes: 1

Related Questions