cph_sto
cph_sto

Reputation: 7597

UT8 issue - Is there a way to convert strange looking characters ä to its proper German character ä in Python?

I have a .txt file, which should contain German Umlauts like ä,ö,ß,ü. But, these characters don't apear as such, instead what appears is ä instead of ä, à instead of Ü and so on. It happens because the .txt file is stored in ANSI encoding. Now, when I import this file, with respective columns as Strings, in either SAS (DataStep) or Python (with .read_csv), then these strange characters appear in the .sas7bat and the Python DataFrame as such, instead of proper characters like ä,ö,ü,ß.

One work around to solve this issue is -

  1. Open the file in standard Notepad.
  2. Press 'Save As' and then a window appears.
  3. Then in the drop down, change encoding to UTF-8.

Now, when you import the files, in SAS or Python, then everything is imported correctly.

But, sometimes the .txt files that I have are very big (in GBs), so I cannot open them and do this hack to solve this issue.

I could use .replace() function, to replace these strange characters with the real ones, but there could be some combinations of strange characters that I am not aware of, that's why I wish to avoid that.

Is there any Python library which can automatically translate these strange characters into their proper characters - like ä gets translated to ä and so on?

Upvotes: 2

Views: 1013

Answers (2)

tripleee
tripleee

Reputation: 189648

If the file contains the correct code points, you just have to specify the correct encoding. Python 3 will default to UTF-8 on most sane platforms, but if you need your code to also run on Windows, you probably want to spell out the encoding.

with open(filename, 'r', encoding='utf-8') as f:
   # do things with f

If the file actually contains mojibake there is no simple way in the general case to revert every possible way to screw up text, but a common mistake is assuming text was in Latin-1 and convert it to UTF-8 when in fact the input was already UTF-8. What you can do then is say you want Latin-1, and probably make sure you save it in the correct format as soon as you have read it.

with open(filename, 'r', encoding='latin-1') as inp, \
     open('newfile', 'w', encoding='utf-8') as outp:
    for line in inp:
        outp.write(line)

The ftfy library claims to be able to identify and correct a number of common mojibake problems.

Upvotes: 0

S.C.A
S.C.A

Reputation: 85

did you try to use codecs library?

import codecs
your_file= codecs.open('your_file.extension','w','encoding_type')

Upvotes: 2

Related Questions