Phillipe Dongwoo Han
Phillipe Dongwoo Han

Reputation: 23

Generating URLs in Python?

I'm trying to get all the links to the articles (which happen to have the class 'title may-blank' to denote them). I'm trying to figure out why the code below generates a whole bunch of "href=" when I run it, instead of returning with the actual URL. I also get a bunch of random text and links after the failed 25 article URLs (all 'href='), but not sure why that's happening since it should stop after it stop finding the class 'title may-blank'. Can you guys help me find out what's wrong?

import urllib2

def get_page(page):

    response = urllib2.urlopen(page)
    html = response.read()
    p = str(html)
    return p

def get_next_target(page):
    start_link = page.find('title may-blank')
    start_quote = page.find('"', start_link + 4)
    end_quote = page.find ('"', start_quote + 1)
    aurl = page[start_quote+1:end_quote] # Gets Article URL
    return aurl, end_quote

def print_all_links(page):
    while True:
        aurl, endpos = get_next_target(page)
        if aurl:
            print("%s" % (aurl))
            print("")
            page = page[endpos:]
        else:
            break

reddit_url = 'http://www.reddit.com/r/worldnews'

print_all_links(get_page(reddit_url))

Upvotes: 0

Views: 107

Answers (2)

enrico.bacis
enrico.bacis

Reputation: 31524

Rawing is correct, but when I face an XY problem I prefer to provide the best way to accomplish X instead of a way to fix Y. You should use an HTML parser like BeautifulSoup to parse webpages:

from bs4 import BeautifulSoup
import urllib2

def print_all_links(page):
    html = urllib2.urlopen(page).read()
    soup = BeautifulSoup(html)
    for a in soup.find_all('a', 'title may-blank ', href=True):
        print(a['href'])

If you are really allergic to HTML parser, at least use regex (even if you should stick to HTML parsing):

import urllib2
import re

def print_all_links(page):
    html = urllib2.urlopen(page).read()
    for href in re.findall(r'<a class="title may-blank " href="(.*?)"', html):
        print(href)

Upvotes: 1

Aran-Fey
Aran-Fey

Reputation: 43286

That's because the line

start_quote = page.find('"', start_link + 4)

doesn't do what you think it does. start_link is set to the index of "title may-blank". So, if you do a page.find at start_link+4, you actually start searching at "e may-blank". If you change

start_quote = page.find('"', start_link + 4)

to

start_quote = page.find('"', start_link + len('title may-blank') + 1)

it'll work.

Upvotes: 0

Related Questions