Reputation: 79
I have several "UTF-16-LE with BOM" encoded files that are roughly 10~50MB in size. I'm trying to split these files into smaller files no bigger than 1MB (e.g., "File1.txt" into "File1-part-0.txt", "File1-part-1.txt", and so on). After running my code, only the first file is less than 1MB, whereas the other files are 1,994 KB (see my results below).
My process for splitting these files is to first read each row in the file, calculate its size in bytes, then concatenate the data if it doesn't exceed the 1MB value. If the additional row would exceed the 1MB value, then it writes that data into a split file, then uses that row to overwrite the current variable. My reasoning for using this splitting process is because I have data in a different language that needs to be translated. As a result, splitting words or sentences may cause the translator to produce incorrect results.
Why is the byte size of the total_bytes
variable larger than the size of the file? It seems like some of the data gets lost during this process.
Below is the current code I'm running.
#!/usr/bin/python3
import os, sys
MAXIMUM_FILESIZE = 1000000 #Maximum allowed filesize in bytes (1MB)
FILETYPE = ".txt" #Filetypes (Example: ".txt")
FILE_ENCODING = "utf-16-le" #File encoding (Example: "utf-16-le" or "utf-8")
import_folder_path = r"path\to\import\folder" #Import folder path
export_folder_path = r"path\to\export\folder" #Export folder path
#Creates list of files in folder
file_list = os.listdir(import_folder_path)
#Reads data from each file in folder
for file in file_list:
#Creates absolute filepath for each file
import_filepath = import_folder_path + "\\" + file
#Reads metadata from file
file_metadata = os.stat(import_filepath)
current_filesize = file_metadata.st_size
with open(import_filepath, "r", encoding=FILE_ENCODING) as input_file:
#Reads every line in file
lines = input_file.readlines()
i = 0 #Initializer for split file counter
#Creates new name for smaller files
temp_file = file.rsplit(FILETYPE)
new_filename = temp_file[0] + "-part-" + str(i) + FILETYPE
export_filepath = export_folder_path + "\\" + new_filename
total_bytes = "" #Initializer for reading data in bytes
for line in lines:
#Concatenates data and checks byte size
if (int(len(total_bytes.encode(FILE_ENCODING))) + int(len(line.encode(FILE_ENCODING))) <= MAXIMUM_FILESIZE):
total_bytes += line
#Writes concatenated data to split file
elif (int(len(total_bytes.encode(FILE_ENCODING))) + int(len(line.encode(FILE_ENCODING))) > MAXIMUM_FILESIZE):
print(f"Size: \"{len(total_bytes.encode(FILE_ENCODING))}\"")
with open(export_filepath, 'w', encoding=FILE_ENCODING) as output_file:
output_file.write(total_bytes)
i += 1 #Increments split file counter
#Creates new filename for next split file
temp_file = file.rsplit(FILETYPE)
new_filename = temp_file[0] + "-part-" + str(i) + FILETYPE
export_filepath = export_folder_path + "\\" + new_filename
#Adds current row being read to overwrite variable data. This is
#needed to prevent the current row of data being read from missing.
total_bytes = line
else:
print("Error")
Upvotes: 0
Views: 351