Reputation: 23
I'm trying to get all the links to the articles (which happen to have the class 'title may-blank' to denote them). I'm trying to figure out why the code below generates a whole bunch of "href=" when I run it, instead of returning with the actual URL. I also get a bunch of random text and links after the failed 25 article URLs (all 'href='), but not sure why that's happening since it should stop after it stop finding the class 'title may-blank'. Can you guys help me find out what's wrong?
import urllib2
def get_page(page):
response = urllib2.urlopen(page)
html = response.read()
p = str(html)
return p
def get_next_target(page):
start_link = page.find('title may-blank')
start_quote = page.find('"', start_link + 4)
end_quote = page.find ('"', start_quote + 1)
aurl = page[start_quote+1:end_quote] # Gets Article URL
return aurl, end_quote
def print_all_links(page):
while True:
aurl, endpos = get_next_target(page)
if aurl:
print("%s" % (aurl))
print("")
page = page[endpos:]
else:
break
reddit_url = 'http://www.reddit.com/r/worldnews'
print_all_links(get_page(reddit_url))
Upvotes: 0
Views: 107
Reputation: 31524
Rawing is correct, but when I face an XY problem I prefer to provide the best way to accomplish X
instead of a way to fix Y
. You should use an HTML parser like BeautifulSoup
to parse webpages:
from bs4 import BeautifulSoup
import urllib2
def print_all_links(page):
html = urllib2.urlopen(page).read()
soup = BeautifulSoup(html)
for a in soup.find_all('a', 'title may-blank ', href=True):
print(a['href'])
If you are really allergic to HTML parser, at least use regex (even if you should stick to HTML parsing):
import urllib2
import re
def print_all_links(page):
html = urllib2.urlopen(page).read()
for href in re.findall(r'<a class="title may-blank " href="(.*?)"', html):
print(href)
Upvotes: 1
Reputation: 43286
That's because the line
start_quote = page.find('"', start_link + 4)
doesn't do what you think it does. start_link is set to the index of "title may-blank". So, if you do a page.find at start_link+4, you actually start searching at "e may-blank". If you change
start_quote = page.find('"', start_link + 4)
to
start_quote = page.find('"', start_link + len('title may-blank') + 1)
it'll work.
Upvotes: 0