Vedran Šego
Vedran Šego

Reputation: 3765

Python Challenge level 17 in Python 3

I recently started playing with The Python Challenge. While fairly convoluted, the required coding isn't very hard, which makes leaning many useful modules quite interesting.

My question is about level 17. I understand the idea of following the clues as was needed in level 4, while collecting the cookies, which is what I did. However, I cannot BZ2-decompress the string that I get.

I tried Googling, and I found a nice blog with the solutions in Python 2. Specifically, the one for the level 17 is here. Analysing that one, I realized that I indeed get the compressed string (from the cookies) right and it decompresses properly in Python 2:

bz2.decompress(urllib.unquote_plus(compressed))

However, bz2.decompress in Python 3 requires a byte array instead of a string, but the obvious Python 3 counterpart of the above line:

bz2.decompress(urllib.parse.unquote_plus(message).encode("utf8"))

fails with OSError: Invalid data stream. I tried all the standard encodings and some variants of the above, but to no avail.

Here is my (non-working) solution so far:

#!/usr/bin/env python3

"""
The Python Challenge #17: http://www.pythonchallenge.com/pc/return/romance.html

This is similar to #4 and it actually uses its solution. However, the key is in
the cookies. The page's cookie says: "you+should+have+followed+busynothing..."

So, we follow the chain from #4, using the word "busynothing" and
reading the cookies.
"""

import urllib.request, urllib.parse
import re
import bz2

nothing = "12345"
last_cookie = None
message = ""
while True:
    headers = dict()
    if last_cookie:
        headers["Cookie"] = last_cookie
    r = urllib.request.Request("http://www.pythonchallenge.com/pc/def/linkedlist.php?busynothing=" + nothing, headers=headers)
    with urllib.request.urlopen(r) as u:
        last_cookie = u.getheader("Set-Cookie")
        m = re.match(r"info=(.*?);", last_cookie)
        if m:
            message += m.group(1)
        text = u.read().decode("utf8")
        print("{} >>> {}".format(nothing, text))
        m = re.search(r"\d+$", text)
        try:
            nothing = str(int(m.group(0)))
        except Exception as e:
            print(e)
            break

print("Cookies message:", message)
print("Decoded:", bz2.decompress(urllib.parse.unquote_plus(message).encode("utf8")))

So, my question is: what would a Python 3 solution to the above problem look like and why doesn't mine work as expected?

I am well aware that parts of this can be done more nicely. I was going for a quick and dirty solution, so my interest here is only that it works (and why not the way I did it above).

Upvotes: 1

Views: 1232

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122252

You need to use the urllib.parse.unquote_to_bytes() function here. It does not support the + to space mapping, but that is trivially worked around with str.replace():

urllib.parse.unquote_to_bytes(message.replace('+', '%20'))

This then decompresses nicely. You can decode the resulting uncompressed string as ASCII:

print("Decoded:", bz2.decompress(urllib.parse.unquote_to_bytes(message.replace('+', '%20'))).decode('ascii'))

Demo using a different message I prepared to not give away the puzzle:

>>> import bz2
>>> import urllib.parse
>>> another_message = 'BZh91AY%26SY%80%F4C%E8%00%00%02%13%80%40%00%04%00%22%E3%8C%00+%00%22%004%D0%40%D04%0C%B7%3B%E6h%B1AIM%3D%5E.%E4%8Ap%A1%21%01%E8%87%D0'
>>> bz2.decompress(urllib.parse.unquote_to_bytes(another_message.replace('+', '%20'))).decode('ascii')
'This is not the message'

Alternatively, tell urllib.unquote_plus() to use the Latin-1 encoding instead of UTF-8. The default error handler for unquote_plus() is set to 'replace', so you never notice that the original data can't be decoded as UTF-8 and thus has bytes replaced with the U+FFFD REPLACEMENT CHARACTER, which is what causes decompression to fail. Latin-1 maps all bytes one-on-oe directly to the first 256 Unicode characters, so you can encode back to the original bytes:

>>> '\ufffd' in urllib.parse.unquote_plus(another_message)
True
>>> bz2.decompress(urllib.parse.unquote_plus(another_message, 'latin1').encode('latin1')).decode('ascii')
'This is not the message'

Upvotes: 3

Related Questions