Levi Melvin
Levi Melvin

Reputation: 11

Rss Feed scraping with BeautifulSoup

I'm having trouble with my script. I am able to get the title and links but i cant seem to open the article and scrape the article. can somebody please help!

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

source  = urlopen('http://www.marketingmag.com.au/feed/').read()

title = re.compile('<title>(.*)</title>')
link = re.compile('<a href="(.*)">')

find_title = re.findall(title, source)
find_link = re.findall(link, source)



literate = []
literate[:] = range(1, 10)

for i in literate:
    print find_title[i]
    print find_link[i]

articlePage = urlopen(find_link[i]).read()

divBegin = articlePage.find('<div class="entry-content">')

article = articlePage[divBegin:(divBegin+1000)]

soup = BeautifulSoup(article)

paragList = soup.findAll('p')

for i in paragList:
        print i
        print ("\n")

Upvotes: 0

Views: 4212

Answers (2)

user945967
user945967

Reputation:

Your Code strongly reminds me of: http://www.youtube.com/watch?v=Ap_DlSrT-iE

Why do you actually use BeautifulSoup for XML parsing? Its built for HTML-Sites and python itself has very good XML-Parsers. Example: http://docs.python.org/library/xml.dom.minidom.html

Upvotes: 0

Senthil Kumaran
Senthil Kumaran

Reputation: 56901

Do not use regex to parse HTML. Just use Beautiful Soup and it's facilities like find_all to get the links and then you can use urllib2.urlopen to open the url and then read the contents.

Upvotes: 2

Related Questions