Reputation: 1
I am new starter in Python and, in general, in coding. So any help is greatly appreciated.
I have more than 3000 text files in a single directory with multiple encodings. And I need to convert them into a single encoding (e.g. utf8) for further NLP work. When I checked the type of these files using shell, I identified the following encodings:
Algol 68 source text, ISO-8859 text, with very long lines
Algol 68 source text, Little-endian UTF-16 Unicode text, with very long lines
Algol 68 source text, Non-ISO extended-ASCII text, with very long lines
Algol 68 source text, Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
ASCII text
ASCII text, with very long lines
data
diff output text, ASCII text
ISO-8859 text, with very long lines
ISO-8859 text, with very long lines, with LF, NEL line terminators
Little-endian UTF-16 Unicode text, with very long lines
Non-ISO extended-ASCII text
Non-ISO extended-ASCII text, with very long lines
Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
UTF-8 Unicode (with BOM) text, with CRLF line terminators
UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators
UTF-8 Unicode text, with very long lines, with CRLF line terminators
Any ideas how to convert text files with the above mentioned encodings into text files with a utf-8 encoding?
Upvotes: 0
Views: 1357
Reputation: 1
I've adapted the script from @Jessica20119 according to my needs because I made some adjustments while converting the files.
import os
import chardet
import codecs
import zipfile
zip_path = 'YourZipFile'
temp_dir = 'temp_dir'
# Ensure temp_dir exists
os.makedirs(temp_dir, exist_ok=True)
def detect_encoding(file_path, chunk_size=1024):
detector = chardet.UniversalDetector()
with open(file_path, 'rb') as f:
while True:
chunk = f.read(chunk_size)
if not chunk:
break
detector.feed(chunk)
if detector.done:
break
detector.close()
return detector.result['encoding']
with zipfile.ZipFile(zip_path, 'r') as zip_file:
for file_name in zip_file.namelist():
if not file_name.endswith('/'): # Ensuring it's a file
full_path = os.path.join(temp_dir, file_name)
# Check if file exists and determine the need for re-processing
if os.path.exists(full_path):
current_encoding = detect_encoding(full_path)
if current_encoding == 'utf-8':
print(f"{file_name} is already UTF-8 encoded and will not be processed again.")
continue # Skip processing if already UTF-8
# Extract the file as it needs processing
zip_file.extract(file_name, temp_dir)
# Process the file for encoding conversion
detected_encoding = detect_encoding(full_path)
if detected_encoding != 'utf-8':
try:
with codecs.open(full_path, "r", encoding=detected_encoding) as source_file:
contents = source_file.read()
with codecs.open(full_path, "w", encoding='utf-8') as target_file:
target_file.write(contents)
print(f"Converted {full_path} to UTF-8.")
except Exception as e:
print(f"Error processing {full_path}: {e}")
Original Objective:
The original objective was to identify the encoding of text files and convert them to UTF-8 if they were not already in that encoding.
Summary of Adjustments:
Functionality to extract and process text files from a ZIP archive. Files are extracted to a temporary directory (temp_dir).
Before extracting and converting a file, the script checks if the file already exists in the temp_dir and if it is already UTF-8 encoded. If so, the file is skipped.
To avoid memory usage problems, files are read and written in chunks of 1024 KB (1 MB).
Added error handling to catch and report any issues during the file reading and writing process.
Upvotes: 0
Reputation: 41
I used two steps to solve this problem.
import os, sys, codecs
import chardet
First, using chardet package to identify the coding of text:
for text in os.listdir(path):
txtPATH = os.path.join(path, text)
txtPATH=str(txtPATH)
f = open(txtPATH, 'rb')
data = f.read()
f_charInfo = chardet.detect(data)
coding2=f_charInfo['encoding']
coding=str(coding2)
print(coding)
data = f.read()
Second, if coding of text is not utf-8, rewrite the text as utf-8 encoding to the directory:
if not re.match(r'.*\.utf-8$', coding, re.IGNORECASE):
print(txtPATH)
print(coding)
with codecs.open(txtPATH, "r", coding) as sourceFile:
contents = sourceFile.read()
with codecs.open(txtPATH, "w", "utf-8") as targetFile:
targetFile.write(contents)
Upvotes: 3