Reputation: 11
I'm having trouble with my script. I am able to get the title and links but i cant seem to open the article and scrape the article. can somebody please help!
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
source = urlopen('http://www.marketingmag.com.au/feed/').read()
title = re.compile('<title>(.*)</title>')
link = re.compile('<a href="(.*)">')
find_title = re.findall(title, source)
find_link = re.findall(link, source)
literate = []
literate[:] = range(1, 10)
for i in literate:
print find_title[i]
print find_link[i]
articlePage = urlopen(find_link[i]).read()
divBegin = articlePage.find('<div class="entry-content">')
article = articlePage[divBegin:(divBegin+1000)]
soup = BeautifulSoup(article)
paragList = soup.findAll('p')
for i in paragList:
print i
print ("\n")
Upvotes: 0
Views: 4212
Reputation:
Your Code strongly reminds me of: http://www.youtube.com/watch?v=Ap_DlSrT-iE
Why do you actually use BeautifulSoup for XML parsing? Its built for HTML-Sites and python itself has very good XML-Parsers. Example: http://docs.python.org/library/xml.dom.minidom.html
Upvotes: 0
Reputation: 56901
Do not use regex to parse HTML. Just use Beautiful Soup and it's facilities like find_all to get the links and then you can use urllib2.urlopen to open the url and then read the contents.
Upvotes: 2