user6067640
user6067640

Reputation:

how to remove same characters and keep unique in text file

this code removes all doubles

lines = open('D:\path\file.txt', 'r').readlines()
lines_set = set(lines)
out  = open('D:\path\file.txt', 'w')
for line in lines_set:
    out.write(line)

from:

3
3
2
7
7
7

I got only:

2

but how to remove same characters and keep only unique, this result:

3
2
7

Upvotes: 0

Views: 56

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1123600

Your code works, for an input file without additional whitespace and a newline terminating every single line in the file. If you only see one line in the output, something else went wrong; perhaps you are looking at the output file before the Python script has exited and the output file is still open for writing (which means the remainder of the lines is still in the OS memory buffer used to improve writing speeds).

However, for your code to work correctly in all circumstances, you need to ignore newlines and other whitespace when checking file contents:

with open('D:\path\file.txt', 'r') as lines:
    lines_set = {line.strip() for line in lines}
with open('D:\path\file.txt', 'w') as out:
    for line in lines_set:
        out.write(line + '\n')

The above code removes whitespace before adding lines to the set, and adds in new newlines when writing. I also used the files as context managers (via the with statement) which ensures they are closed properly once reading or writing has completed.

Rather than read the whole input file into memory, you could just write out lines as you find them, and only track values you have seen so far:

with open('D:\path\file.txt', 'r') as lines:
    seen = set()
    with open('D:\path\file.txt', 'w') as out:
        for line in lines:
            line = line.strip()
            if line not in seen:
                out.write(line + '\n')
                seen.add(line)

This has the added advantage that the order of the unique lines is preserved. This scales with the number of unique lines; unless the number of unique lines in the input file is huge (resulting in a very large output file), you should have no issues processing a large input file.

Upvotes: 1

Related Questions