shadow
shadow

Reputation: 311

Fix filenames with encoding when unzipping with special characters in python

There are many questions about encoding our there but I still have not been able to solve my problem.

Imagine I have three files within a compressed ZIP file:

Übersicht.pdf finalePräsentation münchen

I want to unzip those files so I do:

with zipfile.ZipFile("path/result.zip", "r") as zip_ref:
    zip_ref.extractall("/path/")

The filenames look like crap:

enter image description here

My research shows that filenames are basically byte-strings and that it is impossible for the OS to see what the encoding is. But I was still wondering if there is any way to rectify the problem with the file names so the german "Umlaute" will be displayed correctly.

I tried to change the encoding like this:

    with zipfile.ZipFile(save_as, "r") as zip_ref:
        print(zip_ref.namelist())
        encoded_strings = [s.encode("utf-8") for s in zip_ref.namelist()]
        print(encoded_strings)
        zip_ref.extractall(dest)

I tried this with latin-1, iso and some other encodings and the byte-strings are in fact interpreted differently, but always cryptic. Thus I am asking the question to see if there is a simple way to fix this.

Thanks very much in advance, help is very much appreciated

EDIT: The output of locale give me the following:

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

hexdump of the beginning of the first file reads like this:

0000000 25 50 44 46 2d 31 2e 34 0a 25 93 8c 8b 9e 20 52
0000010 65 70 6f 72 74 4c 61 62 20 47 65 6e 65 72 61 74
0000020 65 64 20 50 44 46 20 64 6f 63 75 6d 65 6e 74 20
0000030 68 74 74 70 3a 2f 2f 77 77 77 2e 72 65 70 6f 72

echo *.pdf | xxd | head gives me this:

00000000: 6669 6e61 6c65 5072 c3a4 7365 6e74 6174  finalePr..sentat
00000010: 696f 6e2e 7064 660a                      ion.pdf.
00000000: 6dc3 bc6e 6368 656e 2e70 6466 0a         m..nchen.pdf.
00000000: c39c 6265 7273 6963 6874 2e70 6466 0a    ..bersicht.pdf.

Upvotes: 3

Views: 5365

Answers (2)

tripleee
tripleee

Reputation: 189638

Thanks for the hex dump. With the updated data, it seems like the file names are completely run of the mill mojibake using probably code page 1252.

destination_file = filename.encode('cp1252').decode('utf-8')

My original speculation from before you updated your question is preserved below as possibly interesting and / or instructive.


Your screen shots are a bit muddy, but it looks vaguely like the file names are encoded as Windows code page 437.

>>> import unicodedata
>>> unicodedata.normalize('NFKD', "Übersicht.pdf").encode('utf-8')
b'U\xcc\x88bersicht.pdf'

Examining character code 0xcc it translates to the glyph ╠‎ (U+2560) in the encodings cp1125, cp437, cp720, cp737, cp775, cp850, cp852, cp855, cp856, cp857, cp858, cp860, cp861, cp862, cp863, cp865, cp866, and cp869; and 0x88 translates to ê‎ (U+00EA in cp437, cp720, cp850, cp857, cp858, cp860, cp861, cp863, and cp865. There are multiple encodings in the intersection, but 437 was by far the most common back in the days when PKzip was invented.

(╠ is double-stroked, whereas your screen shot looks more like a single-stroked version, but this might be just a matter of font design and/or an unclear picture; and the conclusion is compelling enough that I'm going with this.)

(Disclosure: the links are to a page of my own.)

Assuming this analysis is correct, and assuming the zip library gives you the names as byte strings, you should be able to simply decode them with

destination_file = filename.encode('latin-1').decode('cp437')

The detour over Latin-1 obscurely translates each character code to the corresponding byte value (recall that Latin-1 is compatible with Unicode in the first 256 characters, but is a pure 8-bit character encoding) and so we can then map it back to Unicode by decoding it with the correct codec.

Upvotes: 2

Hugo-C
Hugo-C

Reputation: 11

If you don't find the original encoding, you can always try to fall back to ascii with:

[unicodedata.normalize('NFKD', s).encode('ascii', 'ignore') for s in zip_ref.namelist()]

using the built-in lib unicodedata

Upvotes: 1

Related Questions