mr_noob
mr_noob

Reputation: 19

Strange Characters in filename

I'm trying to copy attachments from one confluence page to another in python 3.9 via RestAPI. While doing that I've found a docx-file which has some strange characters in filename. Downloadlink to File

The filename is as follows: Template_Anfrage Eingangsbestätigung.docx

If I'm deleting the char 'ä' it does this: Template_Anfrage Eingangsbestatigung.docx

I would expect this: Template_Anfrage Eingangsbesttigung.docx

Can you tell me what caused this problem. And if you could tell me how to convert these characters to normal utf-8 chars that would be awesome.

Sorry for my bad english. And sorry if this is a stupid question. I'm an absolut beginner and I didn't found a solution on the web because I don't really know what to search for.

Upvotes: 0

Views: 326

Answers (1)

JosefZ
JosefZ

Reputation: 30113

The 'ä' on mac (ä) is different to the 'ä' on windows (ä)

Your issue does not stem from OS difference (Mac versus Windows); it is about Unicode normalization rather, see following script and its output:

import unicodedata

def printref( phase, strings ):
    global origins
    linetemplate = '{0:<10} {1:<4} {2:4} {3:4} {4:4} {5}'
    print( '' )
    print( chr(0x20)*10, phase.ljust(9,chr(0x20)), strings[0]==strings[1] )
    for ii, chars in enumerate( strings): 
        print( linetemplate.format( origins[ii], len(chars), chars,
                chars.encode('utf-8').decode('cp1252'),       # mojibake
                '', ''
                ))
        for char in chars:
            print( linetemplate.format( '', len(char), char,
              char.encode('utf-8').decode('cp1252'),          # mojibake
              unicodedata.category(char),
              unicodedata.name(char,'???') ) )

strings = ['ä',         'ä']
origins = ['filename', 'question']
printref( 'original', strings)
for form in ['NFKC', 'NFKD']:
    printref( form, [ unicodedata.normalize(form, x) for x in strings] )

Output: .\SO\68919847.py

           original  False
filename   2    ä   ä
           1    a    a    Ll   LATIN SMALL LETTER A
           1    ̈    ̈   Mn   COMBINING DIAERESIS
question   1    ä    ä
           1    ä    ä   Ll   LATIN SMALL LETTER A WITH DIAERESIS

           NFKC      True
filename   1    ä    ä
           1    ä    ä   Ll   LATIN SMALL LETTER A WITH DIAERESIS
question   1    ä    ä
           1    ä    ä   Ll   LATIN SMALL LETTER A WITH DIAERESIS

           NFKD      True
filename   2    ä   ä
           1    a    a    Ll   LATIN SMALL LETTER A
           1    ̈    ̈   Mn   COMBINING DIAERESIS
question   2    ä   ä
           1    a    a    Ll   LATIN SMALL LETTER A
           1    ̈    ̈   Mn   COMBINING DIAERESIS

Unfortunately, my browser renders all and ä in the same way; the following picture shows the difference better:

script output

Upvotes: 1

Related Questions