Reputation: 356
I'm currently going through the python challenge, and i'm up to level 4, see here I have only been learning python for a few months, and i'm trying to learn python 3 over 2.x so far so good, except when i use this bit of code, here's the python 2.x version:
import urllib, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
text = urllib.urlopen(prefix + nothing).read()
print text
match = findnothing(text)
if match:
nothing = match.group(1)
print " going to", nothing
else:
break
So to convert this to 3, I would change to this:
import urllib.request, urllib.parse, urllib.error, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
text = urllib.request.urlopen(prefix + nothing).read()
print(text)
match = findnothing(text)
if match:
nothing = match.group(1)
print(" going to", nothing)
else:
break
So if i run the 2.x version it works fine, goes through the loop, scraping the url and goes to the end, i get the following output:
and the next nothing is 72198
going to 72198
and the next nothing is 80992
going to 80992
and the next nothing is 8880
going to 8880 etc
If i run the 3.x version, i get the following output:
b'and the next nothing is 44827'
Traceback (most recent call last):
File "C:\Python32\lvl4.py", line 26, in <module>
match = findnothing(b"text")
TypeError: can't use a string pattern on a bytes-like object
So if i change the r to a b in this line
findnothing = re.compile(b"nothing is (\d+)").search
I get:
b'and the next nothing is 44827'
going to b'44827'
Traceback (most recent call last):
File "C:\Python32\lvl4.py", line 24, in <module>
text = urllib.request.urlopen(prefix + nothing).read()
TypeError: Can't convert 'bytes' object to str implicitly
Any ideas?
I'm pretty new to programming, so please don't bite my head off.
_bk201
Upvotes: 3
Views: 1228
Reputation: 8007
Instead of urllib
we're using requests
and it has two options ( which maybe you can search in urllib for similar options )
Response object
import requests
>>> response = requests.get('https://api.github.com')
Using response.content
- has the bytes
type
>>> response.content
b'{"current_user_url":"https://api.github.com/user","current_us...."}'
While using response.text
- you have the encoded response
>>> response.text
'{"current_user_url":"https://api.github.com/user","current_us...."}'
The default encoding is utf-8
, but you can set it right after the request like so
import requests
>>> response = requests.get('https://api.github.com')
>>> response.encoding = 'SOME_ENCODING'
And then response.text
will hold the content in the encoding you requested ...
Upvotes: 0
Reputation: 414865
You can't mix bytes and str objects implicitly.
The simplest thing would be to decode bytes returned by urlopen().read()
and use str objects everywhere:
text = urllib.request.urlopen(prefix + nothing).read().decode() #note: utf-8
The page doesn't specify the preferable character encoding via Content-Type
header or <meta>
element. I don't know what the default encoding should be for text/html
but the rfc 2068 says:
When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.
Upvotes: 4
Reputation: 9763
Regular expressions make sense only on text, not on binary data.
So, keep findnothing = re.compile(r"nothing is (\d+)").search
, and convert text
to string instead.
Upvotes: 1