NoIdeaHowToFixThis
NoIdeaHowToFixThis

Reputation: 4564

Corrupted Hebrew: saved as ansi - covert back to UTF-8

I suspect some data has been saved (on windows machines) as ANSI. Therefore, the original Hebrew characters got lost and what we see is stuff like ùéôåãé äòéø.

Is the information lost or is there a possibility to map back the characters, knowing that the original text was Hebrew?

Upvotes: 2

Views: 1017

Answers (2)

 Bruno
 Bruno

Reputation: 21

i had a very similar problem where the text look equally corrupted. The online-decoder told me that for some reason the text got encoded with iso-8859-1 instead of iso-8859-8

text.encode("iso-8859-1").decode("iso-8859-8")

Upvotes: 1

Karol S
Karol S

Reputation: 9402

The information is probably not lost, or at most partially lost. If you want to use Python:

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open("input.txt", "r", "windows-1255") as sourceFile:
    with codecs.open("output.txt", "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
               break
            targetFile.write(contents)

Stolen and adapted from How to convert a file to utf-8 in Python?

You can also use an external tool, like iconv:

iconv -f windows-1255 -t utf-8 input.txt > output.txt

Iconv is available in most Linux distibutions, in Cygwin, and on other platforms.

If the file got double-mangled, you may need to do something like this:

iconv -f utf-8 -t windows-1252 input.txt > tmp.txt
iconv -f windows-1255 -t utf-8 tmp.txt > output.txt

but the chances that this kind of stuff happened are minuscule.

Upvotes: 1

Related Questions