I have a .txt file, which should contain German Umlauts like ä,ö,ß,ü. But, these characters don't apear as such, instead what appears is Ã¤ instead of ä, Ã instead of Ü and so on. It happens because the .txt file is stored in ANSI encoding. Now, when I import this file, with respective columns as Strings, in either SAS (DataStep) or Python (with .read_csv), then these strange characters appear in the .sas7bat and the Python DataFrame as such, instead of proper characters like ä,ö,ü,ß. One work around to solve this issue is - Open the file in standard Notepad. Press 'Save As' and then a window appears. Then in the drop down, change encoding to UTF-8. Now, when you import the files, in SAS or Python, then everything is imported correctly. But, sometimes the .txt files that I have are very big (in GBs), so I cannot open them and do this hack to solve this issue. I could use .replace() function, to replace these strange characters with the real ones, but there could be some combinations of strange characters that I am not aware of, that's why I wish to avoid that. Is there any Python library which can automatically translate these strange characters into their proper characters - like Ã¤ gets translated to ä and so on?

Reputation: 7597

UT8 issue - Is there a way to convert strange looking characters Ã¤ to its proper German character ä in Python?

I have a .txt file, which should contain German Umlauts like ä,ö,ß,ü. But, these characters don't apear as such, instead what appears is Ã¤ instead of ä, Ã instead of Ü and so on. It happens because the .txt file is stored in ANSI encoding. Now, when I import this file, with respective columns as Strings, in either SAS (DataStep) or Python (with .read_csv), then these strange characters appear in the .sas7bat and the Python DataFrame as such, instead of proper characters like ä,ö,ü,ß.

One work around to solve this issue is -

Open the file in standard Notepad.
Press 'Save As' and then a window appears.
Then in the drop down, change encoding to UTF-8.

Now, when you import the files, in SAS or Python, then everything is imported correctly.

But, sometimes the .txt files that I have are very big (in GBs), so I cannot open them and do this hack to solve this issue.

I could use .replace() function, to replace these strange characters with the real ones, but there could be some combinations of strange characters that I am not aware of, that's why I wish to avoid that.

Is there any Python library which can automatically translate these strange characters into their proper characters - like Ã¤ gets translated to ä and so on?

Upvotes: 2

Answers (2)

tripleee

Reputation: 189648

If the file contains the correct code points, you just have to specify the correct encoding. Python 3 will default to UTF-8 on most sane platforms, but if you need your code to also run on Windows, you probably want to spell out the encoding.

with open(filename, 'r', encoding='utf-8') as f:
   # do things with f

If the file actually contains mojibake there is no simple way in the general case to revert every possible way to screw up text, but a common mistake is assuming text was in Latin-1 and convert it to UTF-8 when in fact the input was already UTF-8. What you can do then is say you want Latin-1, and probably make sure you save it in the correct format as soon as you have read it.

with open(filename, 'r', encoding='latin-1') as inp, \
     open('newfile', 'w', encoding='utf-8') as outp:
    for line in inp:
        outp.write(line)

The ftfy library claims to be able to identify and correct a number of common mojibake problems.

Upvotes: 0

S.C.A

Reputation: 85

did you try to use codecs library?

import codecs
your_file= codecs.open('your_file.extension','w','encoding_type')

Upvotes: 2

UT8 issue - Is there a way to convert strange looking characters &#195;&#164; to its proper German character &#228; in Python?

Answers (2)

Related Questions

UT8 issue - Is there a way to convert strange looking characters Ã¤ to its proper German character ä in Python?