Reputation: 2594
I have the following code(gdaten[n][2] gives an URL, n is the index):
try:
p=urlparse(gdaten[n][2])
while p.scheme == "javascript" or p.scheme == "mailto":
p=urlparse(gdaten[n][2])
print(p," was skipped (", gdaten[n][2],")")
n += 1
print ("check:", gdaten[n][2])
f = urllib.request.urlopen(gdaten[n][2])
htmlcode = str(f.read())
parser = MyHTMLParser(strict=False)
parser.feed(htmlcode)
except urllib.error.URLError:
#do some stuff
except IndexError:
#do some stuff
except ValueError:
#do some stuff
Now I have the following error:
urllib.error.URLError: <urlopen error unknown url type: javascript>
in line 8. How is that possible? I thought with the while-loop I skip all those links with the scheme javascript? Why does the except not work? Where's my fault?
MyHTMLParser
appends the links found on the website to gdaten like that [[stuff,stuff, link][stuff,stuff, link]
Upvotes: 1
Views: 473
Reputation: 14144
In other words, n
and p
are out of sync.
To fix this, add one to n
before setting p
.
Assuming n
is set to zero at the start (could start at 42
, it doesn't matter), let's say gdaten is laid out like so:
gdaten[0][2] = "javascript://blah.js"
gdaten[1][2] = "http://hello.com"
gdaten[2][2] = "javascript://moo.js"
Upon checking the first while condition, p.scheme is 'javascript'
so we enter the loop. p
gets set to urlparse("javascript://blah.js")
again and n
is increased to 1. Since we're checking urlparse("javascript://blah.js")
again, we continue into the loop again.
p
now gets set to urlparse("http://hello.com")
and n
gets set to 2
.
Since urlparse("http://hello.com")
passes the check, the while loop ends.
Meanwhile, since n
is two, the url that gets opened is gdaten[2][2]
which is "javascript://moo.js"
try:
p=urlparse(gdaten[n][2])
while p.scheme == "javascript" and p.scheme == "mailto" and not p.scheme:
print(p," was skipped (", gdaten[n][2],")")
# Skipping to the next value
n += 1
p=urlparse(gdaten[n][2])
print ("check:", gdaten[n][2])
f = urllib.request.urlopen(gdaten[n][2])
htmlcode = str(f.read())
...
Upvotes: 3