Andrei Smirnov
Andrei Smirnov

Reputation: 93

How do remove continuing repeated words in Python

My task is a fairly complicated one but I kind of tried to narrow it down. I was given a repeated lines of text and I needed to extract sentences so the sentences would make sense to the human eye.

Below was an example input

that are at work there. And when their work has been
that are at work there. And when their work has been done
that are at work there. And when their work has been done.
And when their work has been done. We're
And when their work has been done. We're going
And when their work has been done. We're going to
And when their work has been done. We're going to evaluate
And when their work has been done. We're going to evaluate both
And when their work has been done. We're going to evaluate both their
And when their work has been done. We're going to evaluate both their risk
We're going to evaluate both their risk their
We're going to evaluate both their risk their work
We're going to evaluate both their risk their work.
their work. And
their work. And the
their work. And the results
their work. And the results of
their work. And the results of that
their work. And the results of that work
their work. And the results of that work.
And the results of that work. And
And the results of that work. And to
And the results of that work. And to make
And the results of that work. And to make statements
And the results of that work. And to make statements about
And the results of that work. And to make statements about it
And the results of that work. And to make statements about it at
And the results of that work. And to make statements about it at that
And to make statements about it at that time
And to make statements about it at that time.
time. OK
time. OK Thank
time. OK Thank you
time. OK Thank you ladies
time. OK Thank you ladies and
time. OK Thank you ladies and gentlemen
time. OK Thank you ladies and gentlemen.
OK Thank you ladies and gentlemen. This
OK Thank you ladies and gentlemen. This press
OK Thank you ladies and gentlemen. This press conference
OK Thank you ladies and gentlemen. This press conference is
OK Thank you ladies and gentlemen. This press conference is over
OK Thank you ladies and gentlemen. This press conference is over.
This press conference is over. Thank
This press conference is over. Thank you
This press conference is over. Thank you.
Thank you. If

And I came down to my CURRENT OUTPUT

We have fought for that.
Some of us.
Twenty thirty and forty years and.
Other focus of our emphasis today has to do with the matters that we
to do with the matters that we presented in the letter this does not
presented in the letter this does not mean that these are the only things
mean that these are the only things that we are concerned about.
But these are the matters that we want to put on the table.
Far.
The Honorable President George Bush.
That easiest way to start a dialogue given that it's something you've worked
given that it's something you've worked so hard on voting in registration.
Missile fired a rational Black Caucus for example so that they're frustrated
for example so that they're frustrated at the Bush administration has even
at the Bush administration has even suggested that that's something that
suggested that that's something that should be an issue for them.
That just.
It's been nothing has been done.
Let me say again.
It is a matter about which we have been concerned about which we continue to be
concerned about which we continue to be concerned.
And we're looking very carefully.
At the work of the Civil Rights Commission.
In Florida and the other legal entities that are at work there.
And when their work has been done.
We're going to evaluate both their risk their work.
And the results of that work.
And to make statements about it at that time.
OK Thank you ladies and gentlemen.
This press conference is over.

But you can still see that there are repeated lines such as

Other focus of our emphasis today has to do with the matters that we
to do with the matters that we presented in the letter this does not
presented in the letter this does not mean that these are the only things
mean that these are the only things that we are concerned about.

where each line contains words in the previous line.

below is my code used.

import os
directory = './input'
for filename in os.listdir(directory):
        print("Processing {}".format(filename))

        with open("./input/"+filename) as inputFile:
                data = inputFile.readlines()

        sentences = []
        for line in data:
                for s in line.split("."):
                        sentences.append(s.strip() + ".")
                sentences[-1] = sentences[-1][:-1]

        longest = []
        for s in sentences:
                for i, s2 in enumerate(longest):
                        if s2 and (s2.startswith(s) or s2.endswith(s)): # longer version already in
                                break
                        elif s2 and (s.startswith(s2) or s.endswith(s2)): # new sentence is longer
                                longest[i] = s
                else: # nobreak
                        longest.append(s)

        unique = []
        last = None
        for s in longest:
                if s != last:
                        unique.append(s)
                        last = s

        new_data = ""
        for s in unique:
                new_data = new_data + s + "\n"

        with open("./output/"+filename, "w") as text_file:
                text_file.write(new_data)

Now given the new output ./output/"+filename where I showed the CURRENT OUTPUT, how can I get rid of the repeated lines?

I am continuing off of this code now

import os
directory = './new_input'
for filename in os.listdir(directory):
        print("Processing {}".format(filename))

        with open("./new_input/"+filename) as inputFile:
                data = inputFile.readlines()

PLEASE USE MY CURRENT OUTPUT AS AN INPUT

Upvotes: 1

Views: 135

Answers (2)

user13843220
user13843220

Reputation:

This is the final pure regex solution that covers all your bases
over all the different posts you've made about the same issues.
Each of these posts are presented with a different data look,
but they all boil down to this solution.

Its complex and hard to explain. I recommend you just use it without trying to alter it.

Just a note, it's not possible to parse speech language with regular
expressions. This makes it impossible to discern line breaks if you are trying to format it.
Just do the best you can. Use a separate way to do it.

This method uses re.sub() in a loop until the resulting string equals the
previous string. That's the way the regex is implemented.
The good news is you don't have to come up with all these contortions
of functions that really don't work very well.

Find: r"(?m)(?:((?:^(.*\S.*$)(?=[\S\s]*?^\2[ \t]*\S)\s*)*)(?=^(\2|(.*\S.*))$(?![\S\s]*?^\3[ \t]*\S)))(?:^(.+?)(.+?)$(?:\s*^\s*\6(.+?)$)+)?"
Replace: r"\5\6\7"

The regex does a great deal on many, many level. Its very complex. I've added some comments but if you have any questions, let me know.

 (?m)                          # Multi-line mode
 
 # Here accumulate the longest sub-duplicate(s) emenating from BOL,
 # leave the last longest alone.
 (?:
    (                             # (1 start)
       (?:
          ^ 
          (                             # (2 start)
             .* \S .* $ 
          )                             # (2 end)
          (?=
             [\S\s]*? 
             ^ \2 [ \t]* \S 
          )
          \s* 
       )*
    )                             # (1 end)
    (?=
       ^  
       (                             # (3 start)
          \2 
        | 
          (                             # (4 start)
             .* \S .*  
          )                             # (4 end)
       )                             # (3 end)
       $           
       (?!
          [\S\s]*? 
          ^ \3 [ \t]* \S 
       )
    )
 )
 
 # Here get the overlap duplicate to splice
 (?:
    ^ 
    ( .+? )                       # (5)
    ( .+? )                       # (6)
    $ 
    (?:
       \s* 
       ^ 
       \s* 
       \6 
       ( .+? )                       # (7)
       $ 
    )+
 )?

Python samples of the different data looks

 # -------------------------------------
 # Python sample: Orirginal Input
 # -------------------------------------

>>> import re
>>>
>>> input = '''
... This reminder to our viewers that on Saturday
... This reminder to our viewers that on Saturday at
... This reminder to our viewers that on Saturday at eleven
... This reminder to our viewers that on Saturday at eleven A.M.
... This reminder to our viewers that on Saturday at eleven A.M. Eastern
... This reminder to our viewers that on Saturday at eleven A.M. Eastern Time
... This reminder to our viewers that on Saturday at eleven A.M. Eastern Time in
... Saturday at eleven A.M. Eastern Time in the
... Saturday at eleven A.M. Eastern Time in the morning
... Saturday at eleven A.M. Eastern Time in the morning Pacific
... Saturday at eleven A.M. Eastern Time in the morning Pacific.
... the morning Pacific. We'l
... the morning Pacific. We'l bring
... the morning Pacific. We'l bring you
... the morning Pacific. We'l bring you live
... the morning Pacific. We'l bring you live coverage
... the morning Pacific. We'l bring you live coverage of
... the morning Pacific. We'l bring you live coverage of the
... We'l bring you live coverage of the conference
... We'l bring you live coverage of the conference.
... conference. Focusing
... conference. Focusing on
... conference. Focusing on the
... conference. Focusing on the separation
... conference. Focusing on the separation of
... conference. Focusing on the separation of powers
... conference. Focusing on the separation of powers.
... Focusing on the separation of powers. Sponsored
... Focusing on the separation of powers. Sponsored by
... Focusing on the separation of powers. Sponsored by the
... Focusing on the separation of powers. Sponsored by the Federalist
... Focusing on the separation of powers. Sponsored by the Federalist Society
... Focusing on the separation of powers. Sponsored by the Federalist Society.
... Sponsored by the Federalist Society. Coming
... Sponsored by the Federalist Society. Coming up
... Sponsored by the Federalist Society. Coming up after
... Sponsored by the Federalist Society. Coming up after this
... Sponsored by the Federalist Society. Coming up after this short
... Sponsored by the Federalist Society. Coming up after this short break
... Sponsored by the Federalist Society. Coming up after this short break.
... Coming up after this short break. A
... Coming up after this short break. A spech
... Coming up after this short break. A spech by
... Coming up after this short break. A spech by the
... Coming up after this short break. A spech by the president
... Coming up after this short break. A spech by the president of
... Coming up after this short break. A spech by the president of the
... A spech by the president of the Southern
... A spech by the president of the Southern Christian
... A spech by the president of the Southern Christian Leadership
... Southern Christian Leadership Conference
... Southern Christian Leadership Conference.
... Conference. Joseph
... Conference. Joseph Lowery
... Conference. Joseph Lowery of
... Conference. Joseph Lowery of an
... Conference. Joseph Lowery of an American
... '''
>>>
>>> input_new = ""
>>> isOk = True
>>> while isOk :
...     input_new = re.sub( r"(?m)(?:((?:^(.*\S.*$)(?=[\S\s]*?^\2[ \t]*\S)\s*)*)(?=^(\2|(.*\S.*))$(?![\S\s]*?^\3[ \t]*\S)))(?:^(.+?)(.+?)$(?:\s*^\s*\6(.+?)$)+)?", r"\5\6\7", input)
...     if input_new != input:
...         input = input_new
...     else:
...         isOk = False
...
>>> print( "\r\n" + input + "\r\n" )

This reminder to our viewers that on Saturday at eleven A.M. Eastern Time in the morning Pacific. We'l bring you live coverage of the conference. Focusing on the separation of powers. Sponsored by the Federalist Society. Coming up after thi
s short break. A spech by the president of the Southern Christian Leadership Conference. Joseph Lowery of an American

>>>


 # -------------------------------------
 # Python sample: Alternative Input
 # -------------------------------------

>>> import re
>>>
>>> input = '''
... You can watch a representative.
... Twenty three zero seven of the Rayburn Office Building.
... Washington D.C. each week. C.-SPAN
... Washington D.C. each week. C.-SPAN breaks
... Washington D.C. each week. C.-SPAN breaks from
... Washington D.C. each week. C.-SPAN breaks from its
... Washington D.C. each week. C.-SPAN breaks from its public
... Washington D.C. each week. C.-SPAN breaks from its public affairs
... C.-SPAN breaks from its public affairs programming
... C.-SPAN breaks from its public affairs programming to
... C.-SPAN breaks from its public affairs programming to give
... C.-SPAN breaks from its public affairs programming to give the
... C.-SPAN breaks from its public affairs programming to give the viewer
... C.-SPAN breaks from its public affairs programming to give the viewer updated schedule information.
... Join us at eight o'clock A.M. Eastern five o'clock A.M. Pacific Time.
... Six thirty P.M. Eastern three thirty P.M. Pacific Time.
... Eight o'clock P.M. Eastern five o'clock P.M. Pacific Time.
... One o'clock A.M. Eastern ten o'clock P.M. Pacific Time. As always C.-SPAN
... P.M. Pacific Time. As always C.-SPAN scheduled
... P.M. Pacific Time. As always C.-SPAN scheduled programming
... As always C.-SPAN scheduled programming is preempted by live coverage of the U.S. House of Representatives.
... Going on this election year.
... Covering every issue in the campaign calendar.
... The calendar list the network's plans for campaign.
... From now through election day.
... In addition to election coverage.
... Other major events are cameras record.
... Call toll free one eight hundred three four six. Her it to order the C.-SPAN
... four six. Her it to order the C.-SPAN update for
... Her it to order the C.-SPAN update for twenty four dollars.
... You can use your credit card or will be glad to send you a bill.
... Call one eight hundred three four six eight hundred.
... And you'll receive fifty issues of the C.-SPAN update.
... If you order an update subscription now.
... The receive a free gift. The C.-SPAN road to the White House
... The C.-SPAN road to the White House poster is twenty two by twenty eight inch pen and ink drawing.
... Attractively depicts the spans grassroots approach to the campaign called.
... '''
>>>
>>> input_new = ""
>>> isOk = True
>>> while isOk :
...     input_new = re.sub( r"(?m)(?:((?:^(.*\S.*$)(?=[\S\s]*?^\2[ \t]*\S)\s*)*)(?=^(\2|(.*\S.*))$(?![\S\s]*?^\3[ \t]*\S)))(?:^(.+?)(.+?)$(?:\s*^\s*\6(.+?)$)+)?", r"\5\6\7", input)
...     if input_new != input:
...         input = input_new
...     else:
...         isOk = False
...
>>> print( "\r\n" + input + "\r\n" )

You can watch a representative.
Twenty three zero seven of the Rayburn Office Building.
Washington D.C. each week. C.-SPAN breaks from its public affairs programming to give the viewer updated schedule information.
Join us at eight o'clock A.M. Eastern five o'clock A.M. Pacific Time.
Six thirty P.M. Eastern three thirty P.M. Pacific Time.
Eight o'clock P.M. Eastern five o'clock P.M. Pacific Time.
One o'clock A.M. Eastern ten o'clock P.M. Pacific Time. As always C.-SPAN scheduled programming is preempted by live coverage of the U.S. House of Representatives.
Going on this election year.
Covering every issue in the campaign calendar.
The calendar list the network's plans for campaign.
From now through election day.
In addition to election coverage.
Other major events are cameras record.
Call toll free one eight hundred three four six. Her it to order the C.-SPAN update for twenty four dollars.
You can use your credit card or will be glad to send you a bill.
Call one eight hundred three four six eight hundred.
And you'll receive fifty issues of the C.-SPAN update.
If you order an update subscription now.
The receive a free gift. The C.-SPAN road to the White House poster is twenty two by twenty eight inch pen and ink drawing.
Attractively depicts the spans grassroots approach to the campaign called.

>>>

 # -------------------------------------
 # Python sample: Current Output
 # -------------------------------------

>>>
>>>
>>> import re
>>>
>>> input = '''
... The Up next we bring you a rebroadcast of.
... of. The Diane Rehm radio talk show.
... The Diane Rehm radio talk show. The program is heard over W.A.M. you
... The program is heard over W.A.M. you F.M. on the campus of the American
... F.M. on the campus of the American University in the nation's capital.
... University in the nation's capital. The special Martin Luther King Day show
... The special Martin Luther King Day show recorded Monday.
... recorded Monday. Focused on race relations.
... Focused on race relations. Ms Rames guests were Eleanor Holmes
... Ms Rames guests were Eleanor Holmes Norton.
... '''
>>>
>>> input_new = ""
>>> isOk = True
>>> while isOk :
...     input_new = re.sub( r"(?m)(?:((?:^(.*\S.*$)(?=[\S\s]*?^\2[ \t]*\S)\s*)*)(?=^(\2|(.*\S.*))$(?![\S\s]*?^\3[ \t]*\S)))(?:^(.+?)(.+?)$(?:\s*^\s*\6(.+?)$)+)?", r"\5\6\7", input)
...     if input_new != input:
...         input = input_new
...     else:
...         isOk = False
...
>>> print( "\r\n" + input + "\r\n" )

The Up next we bring you a rebroadcast of. The Diane Rehm radio talk show. The program is heard over W.A.M. you F.M. on the campus of the American University in the nation's capital. The special Martin Luther King Day show recorded Monday.
Focused on race relations. Ms Rames guests were Eleanor Holmes Norton.

>>>

Upvotes: 0

Bill Huang
Bill Huang

Reputation: 4658

I re-implemented your steps into 3 blocks for more maintainability and clarity. The steps are explained as follows.

  1. Loader-discarder: As one can easily seen, the previously-loaded "substrings" are not useful at all. They can be processed as the inevitable and costly line-by-line loading occurs within open(). As a result, every line retained is informational after this step.
  2. Repeating-substring finder: A simple brute-force searcher for the length of repeated substring. If not found, 0 is returned. This takes 2 strings as arguments.
  3. Repeating-substring remover: Apply the function of 2. onto your data. By separating 2. and 3., the responsibility of the string-searching logic and the data-chopping logic were made more clear and traceable.

The rest should be self-explanatory. Try to divide your logic into small and independently testable units, each serving a single logical purpose instead of mixing them altogether. I didn't reuse your code not because I feel the logic is fairly complex and is unlikely to be well-maintained without a proper refactoring.

Code

def load_and_discard(file_path):
    """
    Load and discard previous substrings.

    Args:
        file_path (PathLike): path to data file

    Returns:
        list[str]
    """

    data = []
    with open(infile_path) as f:
        for i, line in enumerate(f):
            st = line.strip()
            if i > 0 and st.startswith(data[-1]):
                data[-1] = st
            elif len(st) > 0:  # guard against empty string
                data.append(st)
    return data


def find_lebms(s1, s2):
    """
    Binary search on the longest-end-begin-matching-substring (LEBMS).

    Args:
        s1 (str): 1st stripped str (match the end)
        s2 (str): 2nd stripped str (match the begin)

    Returns:
        int: length of LEBMS
    """

    # search up to this length
    n1 = min(len(s1), len(s2))

    for i in range(1, n1+1):
        if s1[-i:] == s2[:i]:
            return i
    else:
        return 0


def remove_repeated_substr(data):
    """
    Generate strings (in-place) ready for concatenation by
    removing the repeated substring in the first string.

    Args:
        data (list[str]): list of strings

    Returns:
        None
    """

    n0 = len(data)
    for i, st in enumerate(data):

        # guard: no chopping for the last line
        if i == n0 - 1:
            break

        # chop the current row
        n = find_lebms(st, data[i + 1])
        if n > 0:  # guard against n = 0
            data[i] = st[:-n]

infile_path = "/mnt/ramdisk/in.csv"

data = load_and_discard(infile_path)
remove_repeated_substr(data)

# (optional) prevent un-spaced ending periods
for i, st in enumerate(data):
    if st[-1] == ".":
        data[i] += " "

ans = "".join(data)

Output:

from pprint import pprint

pprint(ans)

# The output text reads smoothly, at least for the sample data provided.
("that are at work there. And when their work has been done. We're going to "
 'evaluate both their risk their work. And the results of that work. And to '
 'make statements about it at that time. OK Thank you ladies and gentlemen. '
 'This press conference is over. Thank you. If')

Upvotes: 2

Related Questions