Kundan Kumar
Kundan Kumar

Reputation: 2002

Python web scraping

On using this regular expression in python :

pathstring = '<span class="titletext">(.*)</span>'
pathFinderTitle = re.compile(pathstring)

My output is:

Govt has nothing to do with former CAG official RP Singh:
Sibal</span></a></h2></div><div class="esc-lead-article-source-wrapper">
<table class="al-attribution single-line-height" cellspacing="0" cellpadding="0">
<tbody><tr><td class="al-attribution-cell source-cell">
<span class='al-attribution-source'>Times of India</span></td>
<td class="al-attribution-cell timestamp-cell">
<span class='dash-separator'>&nbsp;- </span>
<span class='al-attribution-timestamp'>&lrm;46 minutes ago&lrm;

The text find should have stopped at first "< /span>".

Please suggest whats wrong here.

Upvotes: 0

Views: 220

Answers (4)

jdotjdot
jdotjdot

Reputation: 17092

You could also just as easily use BeautifulSoup which is great for doing this kind of thing.

#using BeautifulSoup4, install by "pip install BeautifulSoup4"
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
result = soup.find('span', 'titletext')

And then result would hold the <span> with class titletext as you're looking for.

Upvotes: 0

Steve Mayne
Steve Mayne

Reputation: 22858

.* will match </span> so it keeps on going until the last one.

The best answer is: Don't parse html with regular expressions. Use the lxml library (or something similar).

from lxml import html

html_string = '<blah>'
tree = html.fromstring(html_string)
titles = tree.xpath("//span[@class='titletext']")
for title in titles:
    print title.text

Using a proper xml/html parser will save you massive amounts of time and trouble. If you roll your own parser, you'll have to cater for malformed tags, comments, and myriad other things. Don't reinvent the wheel.

Upvotes: 1

StefanoP
StefanoP

Reputation: 3898

I would suggest using pyquery instead of going mad on regular expressions... It's based on lxml and makes HTML parsing easy as using jQuery.

Something like this is everything you need:

doc = PyQuery(html)
doc('span.titletext').text()

You could also use beautifulsoup, but the result is always the same: don't use regular expressions for parsing HTML, there are tools out there for making your life easier.

Upvotes: 2

phihag
phihag

Reputation: 288280

.* is a greedy match of any characters; it is going to consume as many characters as possible. Instead, use the non-greedy version .*?, as in

pathstring = '<span class="titletext">(.*?)</span>'

Upvotes: 2

Related Questions