Andy Garcia
Andy Garcia

Reputation: 79

Issue with splitting text file into smaller files by rows and bytes

I have several "UTF-16-LE with BOM" encoded files that are roughly 10~50MB in size. I'm trying to split these files into smaller files no bigger than 1MB (e.g., "File1.txt" into "File1-part-0.txt", "File1-part-1.txt", and so on). After running my code, only the first file is less than 1MB, whereas the other files are 1,994 KB (see my results below).

enter image description here

My process for splitting these files is to first read each row in the file, calculate its size in bytes, then concatenate the data if it doesn't exceed the 1MB value. If the additional row would exceed the 1MB value, then it writes that data into a split file, then uses that row to overwrite the current variable. My reasoning for using this splitting process is because I have data in a different language that needs to be translated. As a result, splitting words or sentences may cause the translator to produce incorrect results.

Why is the byte size of the total_bytes variable larger than the size of the file? It seems like some of the data gets lost during this process.

Below is the current code I'm running.

#!/usr/bin/python3
import os, sys

MAXIMUM_FILESIZE = 1000000    #Maximum allowed filesize in bytes (1MB)
FILETYPE = ".txt"             #Filetypes (Example: ".txt")
FILE_ENCODING = "utf-16-le"   #File encoding (Example: "utf-16-le" or "utf-8")    
import_folder_path = r"path\to\import\folder"     #Import folder path
export_folder_path = r"path\to\export\folder"   #Export folder path

#Creates list of files in folder
file_list = os.listdir(import_folder_path)

#Reads data from each file in folder 
for file in file_list:

    #Creates absolute filepath for each file
    import_filepath = import_folder_path + "\\" + file

    #Reads metadata from file
    file_metadata = os.stat(import_filepath)
    current_filesize = file_metadata.st_size

    with open(import_filepath, "r", encoding=FILE_ENCODING) as input_file:

        #Reads every line in file
        lines = input_file.readlines()

    i = 0    #Initializer for split file counter

    #Creates new name for smaller files 
    temp_file = file.rsplit(FILETYPE)
    new_filename = temp_file[0] + "-part-" + str(i) + FILETYPE
    export_filepath = export_folder_path + "\\" + new_filename

    total_bytes = ""   #Initializer for reading data in bytes

    for line in lines:
    
        #Concatenates data and checks byte size
        if (int(len(total_bytes.encode(FILE_ENCODING))) + int(len(line.encode(FILE_ENCODING))) <= MAXIMUM_FILESIZE):

            total_bytes += line

        #Writes concatenated data to split file 
        elif (int(len(total_bytes.encode(FILE_ENCODING))) + int(len(line.encode(FILE_ENCODING))) > MAXIMUM_FILESIZE):

            print(f"Size: \"{len(total_bytes.encode(FILE_ENCODING))}\"")
            with open(export_filepath, 'w', encoding=FILE_ENCODING) as output_file:
                output_file.write(total_bytes)         
        
            i += 1    #Increments split file counter

            #Creates new filename for next split file 
            temp_file = file.rsplit(FILETYPE)
            new_filename = temp_file[0] + "-part-" + str(i) + FILETYPE
            export_filepath = export_folder_path + "\\" + new_filename

            #Adds current row being read to overwrite variable data. This is 
            #needed to prevent the current row of data being read from missing.
            total_bytes = line    

        else:
            print("Error")

Upvotes: 0

Views: 351

Answers (0)

Related Questions