inetphantom
inetphantom

Reputation: 2594

python: urlParse fault? parser Python 3

I have the following code(gdaten[n][2] gives an URL, n is the index):

    try:
        p=urlparse(gdaten[n][2])
        while p.scheme == "javascript" or p.scheme == "mailto":
            p=urlparse(gdaten[n][2])
            print(p," was skipped (", gdaten[n][2],")")
            n += 1
        print ("check:", gdaten[n][2])
        f = urllib.request.urlopen(gdaten[n][2])
        htmlcode = str(f.read())
        parser = MyHTMLParser(strict=False)
        parser.feed(htmlcode)

    except urllib.error.URLError:
        #do some stuff
    except IndexError:
        #do some stuff
    except ValueError:
        #do some stuff

Now I have the following error:

urllib.error.URLError: <urlopen error unknown url type: javascript>

in line 8. How is that possible? I thought with the while-loop I skip all those links with the scheme javascript? Why does the except not work? Where's my fault? MyHTMLParserappends the links found on the website to gdaten like that [[stuff,stuff, link][stuff,stuff, link]

Upvotes: 1

Views: 473

Answers (1)

Kyle Kelley
Kyle Kelley

Reputation: 14144

This is an off by one error.

In other words, n and p are out of sync.

To fix this, add one to n before setting p.

Why wasn't this working?

Assuming n is set to zero at the start (could start at 42, it doesn't matter), let's say gdaten is laid out like so:

gdaten[0][2] = "javascript://blah.js"
gdaten[1][2] = "http://hello.com"
gdaten[2][2] = "javascript://moo.js"

Upon checking the first while condition, p.scheme is 'javascript' so we enter the loop. p gets set to urlparse("javascript://blah.js") again and n is increased to 1. Since we're checking urlparse("javascript://blah.js") again, we continue into the loop again.

p now gets set to urlparse("http://hello.com") and n gets set to 2.

Since urlparse("http://hello.com") passes the check, the while loop ends.

Meanwhile, since n is two, the url that gets opened is gdaten[2][2] which is "javascript://moo.js"

Code fix

try:
    p=urlparse(gdaten[n][2])
    while p.scheme == "javascript" and p.scheme == "mailto" and not p.scheme:
        print(p," was skipped (", gdaten[n][2],")")

        # Skipping to the next value
        n += 1
        p=urlparse(gdaten[n][2])

    print ("check:", gdaten[n][2])
    f = urllib.request.urlopen(gdaten[n][2])
    htmlcode = str(f.read())

...

Upvotes: 3

Related Questions