Reputation: 4564
I suspect some data has been saved (on windows machines) as ANSI.
Therefore, the original Hebrew characters got lost and what we see is stuff like
ùéôåãé äòéø
.
Is the information lost or is there a possibility to map back the characters, knowing that the original text was Hebrew?
Upvotes: 2
Views: 1017
Reputation: 21
i had a very similar problem where the text look equally corrupted.
The online-decoder told me that for some reason the text got encoded with iso-8859-1
instead of iso-8859-8
text.encode("iso-8859-1").decode("iso-8859-8")
Upvotes: 1
Reputation: 9402
The information is probably not lost, or at most partially lost. If you want to use Python:
import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open("input.txt", "r", "windows-1255") as sourceFile:
with codecs.open("output.txt", "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)
Stolen and adapted from How to convert a file to utf-8 in Python?
You can also use an external tool, like iconv:
iconv -f windows-1255 -t utf-8 input.txt > output.txt
Iconv is available in most Linux distibutions, in Cygwin, and on other platforms.
If the file got double-mangled, you may need to do something like this:
iconv -f utf-8 -t windows-1252 input.txt > tmp.txt
iconv -f windows-1255 -t utf-8 tmp.txt > output.txt
but the chances that this kind of stuff happened are minuscule.
Upvotes: 1