Manas Chaturvedi
Manas Chaturvedi

Reputation: 5540

Web scraping using urllib2

I am trying to scrape all the titles off of this RSS Feed:

http://www.quora.com/Python-programming-language-1/rss

This is my code for the same:

import urllib2
import re
content = urllib2.urlopen('http://www.quora.com/Python-programming-language-1/rss').read()
allTitles =  re.compile('<title>(.*)</title>')
list = re.findall(allTitles,content)
for e in range(0, 2):
    print list[e]

However, instead of getting a list of titles as the output, I am getting a bunch of code from the rss source. What am I doing wrong?

Upvotes: 0

Views: 321

Answers (2)

alko
alko

Reputation: 48337

As already mentioned, your code lacks greedy specifier for regexp, and can be fixed with it. But I strongly recommend switching from regular expressions to tools, more suited for xml parsing, such as lxml, BeautifulSoup or specialised rss parsing modules such as feedparser.

For example, see how your task can be done with lxml:

>>> import lxml.etree
>>> rss = lxml.etree.fromstring(content)
>>> titles = rss.findall('.//title')
>>> print '\n'.join(title.text for title in titles[:2])
Questions About Python (programming language) on Quora
Could someone explain for me the following Python function that uses @wraps from functools?

Upvotes: 0

ndpu
ndpu

Reputation: 22561

You should use non-greedy mark (?) in expression:

#allTitles =  re.compile('<title>(.*)</title>')
allTitles =  re.compile('<title>(.*?)</title>')

Without ? all text except last </title> placed in (.*) group...

Upvotes: 2

Related Questions