Akbar Hussein
Akbar Hussein

Reputation: 360

Read file lines and merge them based on their length

EDIT Here is a a text file: https://www.gutenberg.org/files/9830/9830-0.txt

I have a file test_file.txt that consists of lines of various length sizes (number of words). I want to load each line; check its length, if the length is more than or equal >= a minimum threshold (say 20 words), then I append that line to the list named container: container = []. Else, I will have to load another line, and merge it with the current line till I reach that desired length size, then append the resulted line merge into the list container. I will have to do that for all the lines in the file.

Here is my code, it works until the last two lines, it ignores them.

# Creating a generator to load file lines, one by one:

def gen_file_reader(file_path):
    with open(file_path, encoding='utf-8') as file:
        for line in file.readlines():
            yield line

container = [] # List that will contain the results
lines = gen_file_reader('test_file.txt') # Calling the generator function


x = ""
for line in lines:
    while len(x.split()) < 20:
        x = x + line
        break
    else:
        container.append(x)
        x = ""
        container.append(line)

I noticed my code doesn't work for the last two lines in the file, maybe because of the break keyword in the while statement ... There could be other bugs I am not aware of!

EDIT: The End Result for the example file (Assuming we go rid of blank and empty lines), would look like this for the first 4 items in the list container:

["Project Gutenberg's The Beautiful and Damned, by F. Scott Fitzgerald This eBook is for the use of anyone anywhere at no cost and with",
 'almost no restrictions whatsoever.  You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included',
 'with this eBook or online at www.gutenberg.org Title: The Beautiful and Damned Author: F. Scott Fitzgerald Release Date: October 22, 2003 [EBook #9830]',
 'Last updated: January 29, 2020 Language: English Character set encoding: UTF-8 *** START OF THIS PROJECT GUTENBERG EBOOK THE BEAUTIFUL AND DAMNED ***']

Upvotes: 0

Views: 114

Answers (1)

Lanbao
Lanbao

Reputation: 676

In your logic, if your trailing line cannot be concatenated to contain 20 or more words, it will not be added to the container And I think it's better to do the merge logic directly in the generator

def gen_file_reader(file_path):
    with open(file_path, encoding='utf-8') as file:
        for line in file:
            try:
                while len(line.split()) < 20:
                    line += next(file)
                yield line
            except StopIteration:
                yield line


lines = gen_file_reader('test_file.txt')  # Calling the generator function
print(list(lines))

Attached my test_file.txt

my name is Cn-LanBao my name is Cn-LanBao my name is Cn-LanBao
how are you my name is Cn-LanBao my name is Cn-LanBao my name is Cn-LanBao my name is Cn-LanBao
my name is Cn-LanBao my name is Cn-LanBao
my name is Cn-LanBao my name is Cn-LanBao
my name is how are you
my name is how are you
my name is Cn-LanBao
my name is how are you

And Output

['my name is Cn-LanBao my name is Cn-LanBao my name is Cn-LanBao\nhow are you my name is Cn-LanBao my name is Cn-LanBao my name is Cn-LanBao my name is Cn-LanBao\n', 'my name is Cn-LanBao my name is Cn-LanBao\nmy name is Cn-LanBao my name is Cn-LanBao\nmy name is how are you\n', 'my name is how are you\nmy name is Cn-LanBao\nmy name is how are you\n']

Upvotes: 1

Related Questions