a-goonie
a-goonie

Reputation: 99

Remove duplicates in text file line by line

I'm trying to write a Python script that will remove duplicate strings in a text file. However, the de-duplication should only occur within each line.

For example, the text file might contain:

þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;10 ABC\ABCD\ABCDE
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;12 EFG\EFG;þ þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;09 XYZ\XYZ\XYZ;12 EFG\EFG

Thus, in the above example, the script should only remove the bold strings.

I've searched Stack Overflow and elsewhere to try to find a solution, but haven't had much luck. There seem to be many solutions that will remove duplicate lines, but I'm trying to remove duplicates within a line, line-by-line.

Update: Just to clarify - þ is the delimiter for each field, and ; is the delimiter for each item within each field. Within each line, I'm attempting to remove any duplicate strings contained between semicolons.

Update 2: Example edited to reflect that the duplicate value may not always follow directly after the first instance of the value.

Upvotes: 0

Views: 2147

Answers (2)

gregory
gregory

Reputation: 12885

import re
with open('file', 'r') as f:
     file = f.readlines()
for line in file:
     print(re.sub(r'([^;]+;)(\1)', r'\1', line))

Read the file by lines; then replace the duplicates using re.sub.

Upvotes: 0

Sangbok  Lee
Sangbok Lee

Reputation: 2229

@Prune's answer gives the idea but it needs to be modified like this:

input_file = """"þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;10 ABC\ABCD\ABCDE;þ
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;12 EFG\EFG;þ"""""

input = input_file.split("\n")

for line in input:
    seen_item = []
    for item in line.split(";"):
        if item not in seen_item or item == "þ":
             seen_item.append(item)
    print(";".join(seen_item))

Upvotes: 1

Related Questions