Reputation: 329
I wanna find all thing between <span class="">
and </span>
p = re.compile('<span class=\"\">(.*?)\</span>', re.IGNORECASE)
text = re.findall(p, z)
for example in this case <span class="">foo</span>
expected return foo but it returns any thing !!!
why my code goes wrong ?
Cheers
Upvotes: 0
Views: 5000
Reputation: 177800
Your original code works as is. You should use an HTML parser though.
import re
p = re.compile('<span class=\"\">(.*?)\</span>', re.IGNORECASE)
z = '<span class="">foo</span>'
text = re.findall(p, z)
print text
Output:
['foo']
Edit
As Tim points out, re.DOTALL
should be used or the below would fail:
import re
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated foo</span>'''
text = re.findall(p, z)
print text
Even then it would fail for nested spans:
import re
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated<span class="other">other</span>foo</span>'''
text = re.findall(p, z)
print text
Output (failing):
[' a more\ncomplicated<span class="other">other']
So use an HTML parser like BeautifulSoup:
from BeautifulSoup import BeautifulSoup
soup = bs(z)
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated<span class="other">other</span>foo</span>'''
soup = BeautifulSoup(z)
print soup.findAll('span',{'class':''})
print
print soup.findAll('span',{'class':'other'})
Output:
[<span class=""> a more
complicated<span class="other">other</span>foo</span>]
[<span class="other">other</span>]
Upvotes: 2
Reputation: 1122572
Since HTML is not a regular language, you really should use an XML parser instead.
Python has several to choose from:
Upvotes: 4