Reputation: 329

python regex findall

I wanna find all thing between  and 

p = re.compile('<span class=\"\">(.*?)\</span>', re.IGNORECASE)
text = re.findall(p, z)

for example in this case foo expected return foo but it returns any thing !!! why my code goes wrong ?

Cheers

Upvotes: 0

Answers (2)

Mark Tolonen

Reputation: 177800

Your original code works as is. You should use an HTML parser though.

import re
p = re.compile('<span class=\"\">(.*?)\</span>', re.IGNORECASE)
z = '<span class="">foo</span>'
text = re.findall(p, z)
print text

Output:

['foo']

Edit

As Tim points out, re.DOTALL should be used or the below would fail:

import re
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated foo</span>'''
text = re.findall(p, z)
print text

Even then it would fail for nested spans:

import re
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated<span class="other">other</span>foo</span>'''
text = re.findall(p, z)
print text

Output (failing):

[' a more\ncomplicated<span class="other">other']

So use an HTML parser like BeautifulSoup:

from BeautifulSoup import BeautifulSoup
soup = bs(z)
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated<span class="other">other</span>foo</span>'''
soup = BeautifulSoup(z)
print soup.findAll('span',{'class':''})
print
print soup.findAll('span',{'class':'other'})

Output:

[<span class=""> a more
complicated<span class="other">other</span>foo</span>]

[<span class="other">other</span>]

Upvotes: 2

Martijn Pieters

Reputation: 1122572

Since HTML is not a regular language, you really should use an XML parser instead.

Python has several to choose from:

ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.

Upvotes: 4

python regex findall <span>

Answers (2)

Related Questions

python regex findall &lt;span&gt;

Answers (2)

Related Questions

python regex findall <span>