bk201
bk201

Reputation: 356

Python 2to3 not working

I'm currently going through the python challenge, and i'm up to level 4, see here I have only been learning python for a few months, and i'm trying to learn python 3 over 2.x so far so good, except when i use this bit of code, here's the python 2.x version:

import urllib, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
    text = urllib.urlopen(prefix + nothing).read()
    print text
    match = findnothing(text)
    if match:
        nothing = match.group(1)
        print "   going to", nothing
    else:
        break

So to convert this to 3, I would change to this:

import urllib.request, urllib.parse, urllib.error, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
    text = urllib.request.urlopen(prefix + nothing).read()
    print(text)
    match = findnothing(text)
    if match:
        nothing = match.group(1)
        print("   going to", nothing)
    else:
        break

So if i run the 2.x version it works fine, goes through the loop, scraping the url and goes to the end, i get the following output:

and the next nothing is 72198
   going to 72198
and the next nothing is 80992
   going to 80992
and the next nothing is 8880
   going to 8880 etc

If i run the 3.x version, i get the following output:

b'and the next nothing is 44827'
Traceback (most recent call last):
  File "C:\Python32\lvl4.py", line 26, in <module>
    match = findnothing(b"text")
TypeError: can't use a string pattern on a bytes-like object

So if i change the r to a b in this line

findnothing = re.compile(b"nothing is (\d+)").search

I get:

b'and the next nothing is 44827'
   going to b'44827'
Traceback (most recent call last):
  File "C:\Python32\lvl4.py", line 24, in <module>
    text = urllib.request.urlopen(prefix + nothing).read()
TypeError: Can't convert 'bytes' object to str implicitly

Any ideas?

I'm pretty new to programming, so please don't bite my head off.

_bk201

Upvotes: 3

Views: 1228

Answers (3)

Ricky Levi
Ricky Levi

Reputation: 8007

Instead of urllib we're using requests and it has two options ( which maybe you can search in urllib for similar options )

Response object

import requests
>>> response = requests.get('https://api.github.com')

Using response.content - has the bytes type

>>> response.content
b'{"current_user_url":"https://api.github.com/user","current_us...."}'

While using response.text - you have the encoded response

>>> response.text
'{"current_user_url":"https://api.github.com/user","current_us...."}'

The default encoding is utf-8, but you can set it right after the request like so

import requests
>>> response = requests.get('https://api.github.com')
>>> response.encoding = 'SOME_ENCODING'

And then response.text will hold the content in the encoding you requested ...

Upvotes: 0

jfs
jfs

Reputation: 414865

You can't mix bytes and str objects implicitly.

The simplest thing would be to decode bytes returned by urlopen().read() and use str objects everywhere:

text = urllib.request.urlopen(prefix + nothing).read().decode() #note: utf-8

The page doesn't specify the preferable character encoding via Content-Type header or <meta> element. I don't know what the default encoding should be for text/html but the rfc 2068 says:

When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.

Upvotes: 4

Valentin Lorentz
Valentin Lorentz

Reputation: 9763

Regular expressions make sense only on text, not on binary data. So, keep findnothing = re.compile(r"nothing is (\d+)").search, and convert text to string instead.

Upvotes: 1

Related Questions