Reputation: 39
I want to do remove the almost the same duplicates, but only keep the longest one. I am thinking first comparing the first word or first few word to filter out the candidate for comparison. Then compare the length of the remaining elements. If it is the longest, I will write it into a new text file. Here is the test file https://drive.google.com/file/d/1tdewlNtIqBMaldgrUr02kbCKDyndXbSQ/view?usp=sharing
I am Harry.
I am Harry. I like
I am Harry. I like to eat apple.
I am Garry.
I am Garry. I am Hap
I am Garry. I am Happy.
I am Harry. I like to eat apple.
I am Garry. I am Happy.
I am doing it with Python, but the thing just won't work.
f1 = open('a.txt','r') # Read from file
ListofLine = f1.readlines() # Read the line into list
f2 = open('n.txt','w') # Open new file to write
# Iterate all the sentences to compare
for x in len(ListofLine):
# Comparing first word of the sentences
if(ListofLine[x].split()[0] = ListofLine[x+1].split()[0]):
# Comparing the length and keep the longest length sentences
if(len(ListofLine[x])>len(ListofLine[x+1])):
f2.write(ListofLine[x])
f1.close()
f2.close()
Upvotes: 2
Views: 77
Reputation: 11602
If you can define a function that maps each line to a distinct class, you could use itertools.groupby
.
For example, suppose that two strings are similar if they have the same 10 starting chars.
data = """I am Harry.
I am Harry. I like
I am Harry. I like to eat apple.
I am Garry.
I am Garry. I am Hap
I am Garry. I am Happy.""".split('\n')
from itertools import groupby
criterion = lambda s: s[:10]
result = [max(g[1], key=len) for g in groupby(data, criterion)]
# ['I am Harry. I like to eat apple.', 'I am Garry. I am Happy.']
Upvotes: 0
Reputation: 13964
With least effort:
Trick is to not calculate the full length of the new string(or line) and use startswith() to match already checked ones as prefix. With this function you stop at the moment you get a line even slightly lengthier (+1) than the earlier ones, which is all the matters.
ListofLine=["I am Harry.",
"I am Harry. I like to eat apple.",
"I am Garry.",
"I am Garry. I am Happy."]
list=[] # to contain the longest ones
for line in ListofLine: # ListofLine are basically the input lines
found = False
for k in list:
if line.startswith(k):
list.remove(k) # removes relatively smaller one
list.append(line) # add the longer one instead
found= True
break
if found == False: list.append(line)
for item in list:
print item
Finally the list will contain the items which are longest.
https://www.jdoodle.com/embed/v0/vIG
prints:
I am Harry. I like to eat apple.
I am Garry. I am Happy.
Upvotes: 0
Reputation: 22992
You need to define a criteria in order to find what you call the common part. It can be the first sentence, for instance ”I am Harry.”
To parse a sentence, you can use a RegEx, for instance:
import re
# match a sentence finishing by a dot
re_sentence = r'((?:(?!\.|$).)+\.?)\s*'
find_all_sentences = re.compile(re_sentence, flags=re.DOTALL).findall
Here find_all_sentences is a function. It is the result of re.compile
findall function. It’s a helper to find all sentences in a line.
Once this function defined, you can use it to parse the lines and extracts the fist sentence which is considered as the common part to check.
Any time you match a sentence, you can store it in a dict (here I used an OrdererdDict to keep the order of the lines). Of course, if you find a longer line, you can replace the existing line by this one:
import collections
lines = [
"I am Harry. I like to eat apple",
"I am Harry.",
"I am Garry.",
"I am Garry. I am Happy."]
longuest = collections.OrderedDict()
for line in lines:
sentences = find_all_sentences(line)
first = sentences[0]
if first in longuest:
longuest[first] = max([longuest[first], line], key=lambda l: len(l))
else:
longuest[first] = line
Finally you can serialize the result to a file. Or print it:
for line in longuest.values():
print(line)
To write a file, use a with statement:
import io
out_path = 'path/to/sentences.txt'
with io.open(out_path, mode='w', encoding='utf-8') as f:
for line in longuest.values():
print(line, file=f)
Upvotes: 1