pouzzler
pouzzler

Reputation: 1834

urlparse completely failing on every url

The following code doesn't return a single non-empty urlparse.netloc, or urlparse.scheme. The scheme and netloc are prepended to the path component. What am I doing wrong, please?

#! /usr/bin/python
# -*- coding: UTF-8 -*-

from urllib import urlopen  
from urlparse import urlparse, urljoin 
import re   
link_exp = re.compile("href=(.+?)(?:'|\")", re.UNICODE)  

flux = urlopen("http://www.w3.org") 
links = [urlparse(x) for x in link_exp.findall(flux.read())]
for x in links : 
    print x

This extracts every (? maybe my regex is wrong) url, and prints it, except 'http://' is always in the path, rather than in the scheme. How come? And I should probably reimplement the urlparse functionality when I am done with solving this, as this is a course exercice, not a real world scenario. Sorry for not being clearer on this!

Upvotes: 2

Views: 405

Answers (2)

ATOzTOA
ATOzTOA

Reputation: 35980

Use this:

link_exp = re.compile(r"href=\"(.+?)(?:'|\")", re.UNICODE)  

Output:

...
ParseResult(scheme='http', netloc='ev.buaa.edu.cn', path='/', params='', query='', fragment='')
...

Upvotes: 0

Katriel
Katriel

Reputation: 123722

Your regex is wrong:

x = "<a href='http://www.bbcnews.com'>foo</a>"
link_exp.findall(x)
# ["'http://www.bbcnews.com"]

Note that you're including the opening quote.

Upvotes: 2

Related Questions