Reputation: 2002

Python web scraping

On using this regular expression in python :

pathstring = '(.*)'
pathFinderTitle = re.compile(pathstring)

My output is:

Govt has nothing to do with former CAG official RP Singh:
Sibal</span></a></h2></div><div class="esc-lead-article-source-wrapper">
<table class="al-attribution single-line-height" cellspacing="0" cellpadding="0">
<tbody><tr><td class="al-attribution-cell source-cell">
<span class='al-attribution-source'>Times of India</span></td>
<td class="al-attribution-cell timestamp-cell">
<span class='dash-separator'>&nbsp;- </span>
<span class='al-attribution-timestamp'>&lrm;46 minutes ago&lrm;

The text find should have stopped at first "".

Please suggest whats wrong here.

Upvotes: 0

Answers (4)

jdotjdot

Reputation: 17092

You could also just as easily use BeautifulSoup which is great for doing this kind of thing.

#using BeautifulSoup4, install by "pip install BeautifulSoup4"
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
result = soup.find('span', 'titletext')

And then result would hold the  with class titletext as you're looking for.

Upvotes: 0

Steve Mayne

Reputation: 22858

.* will match  so it keeps on going until the last one.

The best answer is: Don't parse html with regular expressions. Use the lxml library (or something similar).

from lxml import html

html_string = '<blah>'
tree = html.fromstring(html_string)
titles = tree.xpath("//span[@class='titletext']")
for title in titles:
    print title.text

Using a proper xml/html parser will save you massive amounts of time and trouble. If you roll your own parser, you'll have to cater for malformed tags, comments, and myriad other things. Don't reinvent the wheel.

Upvotes: 1

StefanoP

Reputation: 3898

I would suggest using pyquery instead of going mad on regular expressions... It's based on lxml and makes HTML parsing easy as using jQuery.

Something like this is everything you need:

doc = PyQuery(html)
doc('span.titletext').text()

You could also use beautifulsoup, but the result is always the same: don't use regular expressions for parsing HTML, there are tools out there for making your life easier.

Upvotes: 2

phihag

Reputation: 288280

.* is a greedy match of any characters; it is going to consume as many characters as possible. Instead, use the non-greedy version .*?, as in

pathstring = '<span class="titletext">(.*?)</span>'

Upvotes: 2

Python web scraping

Answers (4)

Related Questions