Reputation: 2002
On using this regular expression in python :
pathstring = '<span class="titletext">(.*)</span>'
pathFinderTitle = re.compile(pathstring)
My output is:
Govt has nothing to do with former CAG official RP Singh:
Sibal</span></a></h2></div><div class="esc-lead-article-source-wrapper">
<table class="al-attribution single-line-height" cellspacing="0" cellpadding="0">
<tbody><tr><td class="al-attribution-cell source-cell">
<span class='al-attribution-source'>Times of India</span></td>
<td class="al-attribution-cell timestamp-cell">
<span class='dash-separator'> - </span>
<span class='al-attribution-timestamp'>‎46 minutes ago‎
The text find should have stopped at first "< /span>".
Please suggest whats wrong here.
Upvotes: 0
Views: 220
Reputation: 17092
You could also just as easily use BeautifulSoup which is great for doing this kind of thing.
#using BeautifulSoup4, install by "pip install BeautifulSoup4"
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
result = soup.find('span', 'titletext')
And then result
would hold the <span>
with class titletext
as you're looking for.
Upvotes: 0
Reputation: 22858
.*
will match </span>
so it keeps on going until the last one.
The best answer is: Don't parse html with regular expressions. Use the lxml library (or something similar).
from lxml import html
html_string = '<blah>'
tree = html.fromstring(html_string)
titles = tree.xpath("//span[@class='titletext']")
for title in titles:
print title.text
Using a proper xml/html parser will save you massive amounts of time and trouble. If you roll your own parser, you'll have to cater for malformed tags, comments, and myriad other things. Don't reinvent the wheel.
Upvotes: 1
Reputation: 3898
I would suggest using pyquery instead of going mad on regular expressions... It's based on lxml and makes HTML parsing easy as using jQuery.
Something like this is everything you need:
doc = PyQuery(html)
doc('span.titletext').text()
You could also use beautifulsoup, but the result is always the same: don't use regular expressions for parsing HTML, there are tools out there for making your life easier.
Upvotes: 2
Reputation: 288280
.*
is a greedy match of any characters; it is going to consume as many characters as possible. Instead, use the non-greedy version .*?
, as in
pathstring = '<span class="titletext">(.*?)</span>'
Upvotes: 2