Reputation: 93
I have a complex task that is to remove repeating continuous words or sentences. Below is an example input.
The
The Up
The Up next
The Up next we
The Up next we bring
The Up next we bring you
The Up next we bring you a
The Up next we bring you a rebroadcast
The Up next we bring you a rebroadcast of
The Up next we bring you a rebroadcast of.
of. The
of. The Diane
of. The Diane Rehm
of. The Diane Rehm radio
of. The Diane Rehm radio talk
of. The Diane Rehm radio talk show
of. The Diane Rehm radio talk show.
The Diane Rehm radio talk show. The
The Diane Rehm radio talk show. The program
The Diane Rehm radio talk show. The program is
The Diane Rehm radio talk show. The program is heard
The Diane Rehm radio talk show. The program is heard over
The Diane Rehm radio talk show. The program is heard over W.A.M.
The Diane Rehm radio talk show. The program is heard over W.A.M. you
The program is heard over W.A.M. you F.M.
The program is heard over W.A.M. you F.M. on
The program is heard over W.A.M. you F.M. on the
The program is heard over W.A.M. you F.M. on the campus
The program is heard over W.A.M. you F.M. on the campus of
The program is heard over W.A.M. you F.M. on the campus of the
The program is heard over W.A.M. you F.M. on the campus of the American
F.M. on the campus of the American University
F.M. on the campus of the American University in
F.M. on the campus of the American University in the
F.M. on the campus of the American University in the nation's
F.M. on the campus of the American University in the nation's capital
F.M. on the campus of the American University in the nation's capital.
University in the nation's capital. The
University in the nation's capital. The special
University in the nation's capital. The special Martin
University in the nation's capital. The special Martin Luther
University in the nation's capital. The special Martin Luther King
University in the nation's capital. The special Martin Luther King Day
University in the nation's capital. The special Martin Luther King Day show
The special Martin Luther King Day show recorded
The special Martin Luther King Day show recorded Monday
The special Martin Luther King Day show recorded Monday.
recorded Monday. Focused
recorded Monday. Focused on
recorded Monday. Focused on race
recorded Monday. Focused on race relations
recorded Monday. Focused on race relations.
Focused on race relations. Ms
Focused on race relations. Ms Rames
Focused on race relations. Ms Rames guests
Focused on race relations. Ms Rames guests were
Focused on race relations. Ms Rames guests were Eleanor
Focused on race relations. Ms Rames guests were Eleanor Holmes
Ms Rames guests were Eleanor Holmes Norton
Ms Rames guests were Eleanor Holmes Norton.
Current Output is below
The Up next we bring you a rebroadcast of.
of. The Diane Rehm radio talk show.
The Diane Rehm radio talk show. The program is heard over W.A.M. you
The program is heard over W.A.M. you F.M. on the campus of the American
F.M. on the campus of the American University in the nation's capital.
University in the nation's capital. The special Martin Luther King Day show
The special Martin Luther King Day show recorded Monday.
recorded Monday. Focused on race relations.
Focused on race relations. Ms Rames guests were Eleanor Holmes
Ms Rames guests were Eleanor Holmes Norton.
As you can see, even after this process, we still have repetition such as
The Up next we bring you a rebroadcast of.
of. The Diane Rehm radio talk show.
The Diane Rehm radio talk show. The program is heard over W.A.M. you
The program is heard over W.A.M. you F.M. on the campus of the American
I simply want something like
The Up next we bring you a rebroadcast of.
The Diane Rehm radio talk show.
The program is heard over W.A.M. you F.M. on the campus of the American
University in the nation's capital. The special Martin Luther King Day show
recorded Monday. Focused on race relations.
...etc
How do I accomplish this task?
current code
import os
def load_and_discard(file_path):
"""
Load and discard previous substrings.
Args:
file_path (PathLike): path to data file
Returns:
list[str]
"""
data = []
with open("./input/"+infile_path) as f:
for i, line in enumerate(f):
st = line.strip()
if i > 0 and st.startswith(data[-1]):
data[-1] = st
elif len(st) > 0: # guard against empty string
data.append(st)
return data
def find_lebms(s1, s2):
"""
Binary search on the longest-end-begin-matching-substring (LEBMS).
Args:
s1 (str): 1st stripped str (match the end)
s2 (str): 2nd stripped str (match the begin)
Returns:
int: length of LEBMS
"""
# search up to this length
n1 = min(len(s1), len(s2))
for i in range(1, n1+1):
if s1[-i:] == s2[:i]:
return i
else:
return 0
def remove_repeated_substr(data):
"""
Generate strings (in-place) ready for concatenation by
removing the repeated substring in the first string.
Args:
data (list[str]): list of strings
Returns:
None
"""
n0 = len(data)
for i, st in enumerate(data):
# guard: no chopping for the last line
if i == n0 - 1:
break
# chop the current row
n = find_lebms(st, data[i + 1])
if n > 0: # guard against n = 0
data[i] = st[:-n]
directory = './input'
for filename in os.listdir(directory):
infile_path = filename
data = load_and_discard(infile_path)
remove_repeated_substr(data)
# (optional) prevent un-spaced ending periods
for i, st in enumerate(data):
if st[-1] == ".":
data[i] += " "
ans = "\n".join(data)
with open("./output/"+filename, "w") as text_file:
text_file.write(ans)
If you would like to, you can use the output as an input if it's easier. So you don't have to process the repeating lines. It's totally up to you if you want to use the input as your input or my output as your input. But when you post, please let me know.
Alternative input
You can watch a representative.
Twenty three zero seven of the Rayburn Office Building.
Washington D.C. each week. C.-SPAN
Washington D.C. each week. C.-SPAN breaks
Washington D.C. each week. C.-SPAN breaks from
Washington D.C. each week. C.-SPAN breaks from its
Washington D.C. each week. C.-SPAN breaks from its public
Washington D.C. each week. C.-SPAN breaks from its public affairs
C.-SPAN breaks from its public affairs programming
C.-SPAN breaks from its public affairs programming to
C.-SPAN breaks from its public affairs programming to give
C.-SPAN breaks from its public affairs programming to give the
C.-SPAN breaks from its public affairs programming to give the viewer
C.-SPAN breaks from its public affairs programming to give the viewer updated schedule information.
Join us at eight o'clock A.M. Eastern five o'clock A.M. Pacific Time.
Six thirty P.M. Eastern three thirty P.M. Pacific Time.
Eight o'clock P.M. Eastern five o'clock P.M. Pacific Time.
One o'clock A.M. Eastern ten o'clock P.M. Pacific Time. As always C.-SPAN
P.M. Pacific Time. As always C.-SPAN scheduled
P.M. Pacific Time. As always C.-SPAN scheduled programming
As always C.-SPAN scheduled programming is preempted by live coverage of the U.S. House of Representatives.
Going on this election year.
Covering every issue in the campaign calendar.
The calendar list the network's plans for campaign.
From now through election day.
In addition to election coverage.
Other major events are cameras record.
Call toll free one eight hundred three four six. Her it to order the C.-SPAN
four six. Her it to order the C.-SPAN update for
Her it to order the C.-SPAN update for twenty four dollars.
You can use your credit card or will be glad to send you a bill.
Call one eight hundred three four six eight hundred.
And you'll receive fifty issues of the C.-SPAN update.
If you order an update subscription now.
The receive a free gift. The C.-SPAN road to the White House
The C.-SPAN road to the White House poster is twenty two by twenty eight inch pen and ink drawing.
Attractively depicts the spans grassroots approach to the campaign called.
Upvotes: 0
Views: 139
Reputation: 786146
You may be able to use this regex using a lookahead and back-reference to match overlapping duplicates and remove them.
(\b[-\w\s.']+?)(?=[\s.]+\1)[\s.]+
Use an empty string for replacement.
Code:
s = re.sub(r'(\b[-\w\s.']+?)(?=[\s.]+\1)[\s.]+', '\n', s)
RegEx Details:
(
: Start capture group #1
\b
: Word boundary[-\w\s.']+
: Match 1+ word, space, hyphen, dot or '
characters)
: End capture group #1(?=[\s.]+\1)
: Positive lookahead to assert that we have a presence of group 1's captured value ahead of use after 1+ spaces/dots[\s.]+
: Match 1+ spaces or dotsTo keep multiple lines you may use 2 replacements:
s = re.sub(r'(\b[-\w\s.']+?)(?=[\s.]+\1)[\s.]+', '\n', s)
s = re.sub(r'\A\n+|(?<=[^.] )\n+|\n+(?=\n)|\n+\Z', '', s)
Upvotes: 1