How do remove continuing repeated words in Python

Question

My task is a fairly complicated one but I kind of tried to narrow it down. I was given a repeated lines of text and I needed to extract sentences so the sentences would make sense to the human eye.

Below was an example input

that are at work there. And when their work has been
that are at work there. And when their work has been done
that are at work there. And when their work has been done.
And when their work has been done. We're
And when their work has been done. We're going
And when their work has been done. We're going to
And when their work has been done. We're going to evaluate
And when their work has been done. We're going to evaluate both
And when their work has been done. We're going to evaluate both their
And when their work has been done. We're going to evaluate both their risk
We're going to evaluate both their risk their
We're going to evaluate both their risk their work
We're going to evaluate both their risk their work.
their work. And
their work. And the
their work. And the results
their work. And the results of
their work. And the results of that
their work. And the results of that work
their work. And the results of that work.
And the results of that work. And
And the results of that work. And to
And the results of that work. And to make
And the results of that work. And to make statements
And the results of that work. And to make statements about
And the results of that work. And to make statements about it
And the results of that work. And to make statements about it at
And the results of that work. And to make statements about it at that
And to make statements about it at that time
And to make statements about it at that time.
time. OK
time. OK Thank
time. OK Thank you
time. OK Thank you ladies
time. OK Thank you ladies and
time. OK Thank you ladies and gentlemen
time. OK Thank you ladies and gentlemen.
OK Thank you ladies and gentlemen. This
OK Thank you ladies and gentlemen. This press
OK Thank you ladies and gentlemen. This press conference
OK Thank you ladies and gentlemen. This press conference is
OK Thank you ladies and gentlemen. This press conference is over
OK Thank you ladies and gentlemen. This press conference is over.
This press conference is over. Thank
This press conference is over. Thank you
This press conference is over. Thank you.
Thank you. If

And I came down to my CURRENT OUTPUT

We have fought for that.
Some of us.
Twenty thirty and forty years and.
Other focus of our emphasis today has to do with the matters that we
to do with the matters that we presented in the letter this does not
presented in the letter this does not mean that these are the only things
mean that these are the only things that we are concerned about.
But these are the matters that we want to put on the table.
Far.
The Honorable President George Bush.
That easiest way to start a dialogue given that it's something you've worked
given that it's something you've worked so hard on voting in registration.
Missile fired a rational Black Caucus for example so that they're frustrated
for example so that they're frustrated at the Bush administration has even
at the Bush administration has even suggested that that's something that
suggested that that's something that should be an issue for them.
That just.
It's been nothing has been done.
Let me say again.
It is a matter about which we have been concerned about which we continue to be
concerned about which we continue to be concerned.
And we're looking very carefully.
At the work of the Civil Rights Commission.
In Florida and the other legal entities that are at work there.
And when their work has been done.
We're going to evaluate both their risk their work.
And the results of that work.
And to make statements about it at that time.
OK Thank you ladies and gentlemen.
This press conference is over.

But you can still see that there are repeated lines such as

Other focus of our emphasis today has to do with the matters that we
to do with the matters that we presented in the letter this does not
presented in the letter this does not mean that these are the only things
mean that these are the only things that we are concerned about.

where each line contains words in the previous line.

below is my code used.

import os
directory = './input'
for filename in os.listdir(directory):
        print("Processing {}".format(filename))

        with open("./input/"+filename) as inputFile:
                data = inputFile.readlines()

        sentences = []
        for line in data:
                for s in line.split("."):
                        sentences.append(s.strip() + ".")
                sentences[-1] = sentences[-1][:-1]

        longest = []
        for s in sentences:
                for i, s2 in enumerate(longest):
                        if s2 and (s2.startswith(s) or s2.endswith(s)): # longer version already in
                                break
                        elif s2 and (s.startswith(s2) or s.endswith(s2)): # new sentence is longer
                                longest[i] = s
                else: # nobreak
                        longest.append(s)

        unique = []
        last = None
        for s in longest:
                if s != last:
                        unique.append(s)
                        last = s

        new_data = ""
        for s in unique:
                new_data = new_data + s + "
"

        with open("./output/"+filename, "w") as text_file:
                text_file.write(new_data)

Now given the new output ./output/"+filename where I showed the CURRENT OUTPUT, how can I get rid of the repeated lines?

I am continuing off of this code now

import os
directory = './new_input'
for filename in os.listdir(directory):
        print("Processing {}".format(filename))

        with open("./new_input/"+filename) as inputFile:
                data = inputFile.readlines()

PLEASE USE MY CURRENT OUTPUT AS AN INPUT

Bill Huang · Accepted Answer

I re-implemented your steps into 3 blocks for more maintainability and clarity. The steps are explained as follows.

Loader-discarder: As one can easily seen, the previously-loaded "substrings" are not useful at all. They can be processed as the inevitable and costly line-by-line loading occurs within open(). As a result, every line retained is informational after this step.
Repeating-substring finder: A simple brute-force searcher for the length of repeated substring. If not found, 0 is returned. This takes 2 strings as arguments.
Repeating-substring remover: Apply the function of 2. onto your data. By separating 2. and 3., the responsibility of the string-searching logic and the data-chopping logic were made more clear and traceable.

The rest should be self-explanatory. Try to divide your logic into small and independently testable units, each serving a single logical purpose instead of mixing them altogether. I didn't reuse your code not because I feel the logic is fairly complex and is unlikely to be well-maintained without a proper refactoring.

Code

def load_and_discard(file_path):
    """
    Load and discard previous substrings.

    Args:
        file_path (PathLike): path to data file

    Returns:
        list[str]
    """

    data = []
    with open(infile_path) as f:
        for i, line in enumerate(f):
            st = line.strip()
            if i > 0 and st.startswith(data[-1]):
                data[-1] = st
            elif len(st) > 0:  # guard against empty string
                data.append(st)
    return data


def find_lebms(s1, s2):
    """
    Binary search on the longest-end-begin-matching-substring (LEBMS).

    Args:
        s1 (str): 1st stripped str (match the end)
        s2 (str): 2nd stripped str (match the begin)

    Returns:
        int: length of LEBMS
    """

    # search up to this length
    n1 = min(len(s1), len(s2))

    for i in range(1, n1+1):
        if s1[-i:] == s2[:i]:
            return i
    else:
        return 0


def remove_repeated_substr(data):
    """
    Generate strings (in-place) ready for concatenation by
    removing the repeated substring in the first string.

    Args:
        data (list[str]): list of strings

    Returns:
        None
    """

    n0 = len(data)
    for i, st in enumerate(data):

        # guard: no chopping for the last line
        if i == n0 - 1:
            break

        # chop the current row
        n = find_lebms(st, data[i + 1])
        if n > 0:  # guard against n = 0
            data[i] = st[:-n]

infile_path = "/mnt/ramdisk/in.csv"

data = load_and_discard(infile_path)
remove_repeated_substr(data)

# (optional) prevent un-spaced ending periods
for i, st in enumerate(data):
    if st[-1] == ".":
        data[i] += " "

ans = "".join(data)

Output:

from pprint import pprint

pprint(ans)

# The output text reads smoothly, at least for the sample data provided.
("that are at work there. And when their work has been done. We're going to "
 'evaluate both their risk their work. And the results of that work. And to '
 'make statements about it at that time. OK Thank you ladies and gentlemen. '
 'This press conference is over. Thank you. If')

How do remove continuing repeated words in Python

Answers (2)

Related Questions