E.Praneeth
E.Praneeth

Reputation: 264

How to convert encoding of text file (which contains text of language other than English) from "UTF-16 LE" to "UTF-8" in Python?

I have few text files which contain text in Hindi language in a folder. But those text files are in UTF-16 LE Encoding. I want to change the encoding to UTF-8 without changing text in it. How can I do that?

I wrote two python files but none of them are working proprely. When I run any of them, along with changing the encoding, they clear the file content. These are code in my Python files:

File 1:

import os
for root, dirs, files in os.walk("."):  
    for filename in files:
        #print(filename[-4:])
        if(filename[-3:] == "txt"):
            f= open(filename,"w+")
            x = f.read()
            print(x)
            f.close()
            f1= open(filename, "w+", encoding="utf-8")
            f1.write(x)
            f1.close()

File 2:

import codecs
BLOCKSIZE = 1048576
with codecs.open("ee.txt", "r", "utf-16-le") as sourceFile:
    with codecs.open("ee.txt", "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            print(contents)
            if not contents:
                break
            targetFile.write(contents)

Upvotes: 1

Views: 3326

Answers (2)

jsbueno
jsbueno

Reputation: 110301

You are not specifying the files are in utf-16 LE when reading the contents - that, and there is this confusion of trying to read and write to the same file at the same time, which won't work.

Also, unless you are running this code in a server where an attack attempt may be made by sending you an inordinately big text file, you should not worry about file size, and just read all file contents at once. (For you to have an idea, the Bible which is a big book is on the order of 3 MB in size (with 8bit encoding) - and even small VPS servers will have at on the order of 200MB of memory available to your program - that is, you could convert a book the size of 30+ bibles at once). Typical desktop computers will have several times this amount of memory.

Also, the relatively recent "pathlib" Python library can ease terating through all your text files, and its Path.read_text and Path.write_text methods will open a file, read or write the contents in the correct encoding, and close it in a single expression. Since when using this method, at time of writting the file the reading will be already done, we can simply do it with two calls:

import pathlib
for filepath in pathlib.Path(".").glob("**/*.txt"):
   data = filepath.read_text(encoding="utf-16 LE")
   filepath.write_text(data, encoding="utf-8")

If you prefer to be on the safe side, on the very, very unlikely of a catastrophic computer crash on the middle of a file conversion, you could write to a diffrently named file, and do the deleting/rename afterwards - so the code is like this:

import pathlib
for filepath in pathlib.Path(".").glob("**/*.txt"):
   data = filepath.read_text(encoding="utf-16 LE")
   tmp_name = filepath.name + ".tmp"
   filepath.with_name(tmp_name).write_text(data, encoding="utf-8")
   filepath.unlink()
   filepath.with_name(tmp_name).rename(filepath.name)

Upvotes: 2

Giacomo Catenazzi
Giacomo Catenazzi

Reputation: 9523

Before to explain you what it is wrong two useful tips:

I think you should remove the print. It will just confuse you, and it depends on the operating system and environment what encoding it will print.

Try with a very short file (few character) and check the input and output of both files either as text and as bytes.

Now the solution:

On the first example: you should open the first file as read (r).

On second example: you open the same file, first step to read but before you read the file you open it to write, so you truncate the file, and you will have no characters to read.

Use a ee.txt.tmp file to write, and at the end, if there are no error, you can move the tmp file removing the .tmp prefix.

In general: never read and write on the same file.

Upvotes: 0

Related Questions