Harry Hau
Harry Hau

Reputation: 39

Almost the same duplicates but only different in length

I want to do remove the almost the same duplicates, but only keep the longest one. I am thinking first comparing the first word or first few word to filter out the candidate for comparison. Then compare the length of the remaining elements. If it is the longest, I will write it into a new text file. Here is the test file https://drive.google.com/file/d/1tdewlNtIqBMaldgrUr02kbCKDyndXbSQ/view?usp=sharing

Input

I am Harry.
I am Harry. I like 
I am Harry. I like to eat apple.
I am Garry.
I am Garry. I am Hap
I am Garry. I am Happy.

Output

I am Harry. I like to eat apple.
I am Garry. I am Happy.

I am doing it with Python, but the thing just won't work.

Code

f1 = open('a.txt','r') # Read from file
ListofLine = f1.readlines() # Read the line into list
f2 = open('n.txt','w') # Open new file to write

# Iterate all the sentences to compare
for x in len(ListofLine):
    # Comparing first word of the sentences
    if(ListofLine[x].split()[0] = ListofLine[x+1].split()[0]):
        # Comparing the length and keep the longest length sentences
        if(len(ListofLine[x])>len(ListofLine[x+1])):
            f2.write(ListofLine[x])

f1.close()   
f2.close()

Upvotes: 2

Views: 77

Answers (3)

hilberts_drinking_problem
hilberts_drinking_problem

Reputation: 11602

If you can define a function that maps each line to a distinct class, you could use itertools.groupby.

For example, suppose that two strings are similar if they have the same 10 starting chars.

data = """I am Harry.
I am Harry. I like
I am Harry. I like to eat apple.
I am Garry.
I am Garry. I am Hap
I am Garry. I am Happy.""".split('\n')

from itertools import groupby
criterion = lambda s: s[:10]

result = [max(g[1], key=len) for g in groupby(data, criterion)]
# ['I am Harry. I like to eat apple.', 'I am Garry. I am Happy.']

Upvotes: 0

Saurav Sahu
Saurav Sahu

Reputation: 13964

With least effort:

Trick is to not calculate the full length of the new string(or line) and use startswith() to match already checked ones as prefix. With this function you stop at the moment you get a line even slightly lengthier (+1) than the earlier ones, which is all the matters.

ListofLine=["I am Harry.",
"I am Harry. I like to eat apple.",
"I am Garry.",
"I am Garry. I am Happy."]
list=[]   # to contain the longest ones

for line in ListofLine:  # ListofLine are basically the input lines
    found = False
    for k in list:  
        if line.startswith(k):
            list.remove(k)  # removes relatively smaller one
            list.append(line) # add the longer one instead
            found= True
            break
    if found == False: list.append(line)
for item in list:
    print item

Finally the list will contain the items which are longest.

https://www.jdoodle.com/embed/v0/vIG

prints:

I am Harry. I like to eat apple.
I am Garry. I am Happy.

Upvotes: 0

Laurent LAPORTE
Laurent LAPORTE

Reputation: 22992

You need to define a criteria in order to find what you call the common part. It can be the first sentence, for instance ”I am Harry.”

To parse a sentence, you can use a RegEx, for instance:

import re


# match a sentence finishing by a dot
re_sentence = r'((?:(?!\.|$).)+\.?)\s*'
find_all_sentences = re.compile(re_sentence, flags=re.DOTALL).findall

Here find_all_sentences is a function. It is the result of re.compile findall function. It’s a helper to find all sentences in a line.

Once this function defined, you can use it to parse the lines and extracts the fist sentence which is considered as the common part to check.

Any time you match a sentence, you can store it in a dict (here I used an OrdererdDict to keep the order of the lines). Of course, if you find a longer line, you can replace the existing line by this one:

import collections

lines = [
    "I am Harry. I like to eat apple",
    "I am Harry.",
    "I am Garry.",
    "I am Garry. I am Happy."]

longuest = collections.OrderedDict()
for line in lines:
    sentences = find_all_sentences(line)
    first = sentences[0]
    if first in longuest:
        longuest[first] = max([longuest[first], line], key=lambda l: len(l))
    else:
        longuest[first] = line

Finally you can serialize the result to a file. Or print it:

for line in longuest.values():
    print(line)

To write a file, use a with statement:

import io


out_path = 'path/to/sentences.txt'

with io.open(out_path, mode='w', encoding='utf-8') as f:
    for line in longuest.values():
        print(line, file=f)

Upvotes: 1

Related Questions