How do I remove duplicate entries in my output file in Python?

Question

I'm very new to Python. I am trying to extract data from a text file in the format:

85729 block addressing index approximate text retrieval

85730 automatic query expansion based divergence etc...

The output text file is a list of the words but with no duplicate entries. The text file that is input can have duplicates. The output will look like this:

block

addressing

index

approximate

etc....

With my code so far, I am able to get the list of words but the duplicates are included. I try to check for duplicates before I enter a word into the output file but the output does not reflect that. Any suggestions? My code:

infile = open("paper.txt", 'r')
outfile = open("vocab.txt", 'r+a')
lines = infile.readlines()
for i in lines:
   thisline = i.split()
   for word in thisline:
       digit = word.isdigit()
       found = False
       for line in outfile:
            if word in line:
                found = True
                break  
       if (digit == False) and (found == False ):   
                    outfile.write(word);
                    outfile.write("
");

I don't understand how for loops are closed in Python. In C++ or Java, the curly braces can be used to define the body of a for loop but I'm not sure how its done in Python. Can anyone help?

dstromberg · Accepted Answer

Python loops are closed by dedenting; the whitespace on the left has semantic meaning. This saves you from furiously typing curly braces or do/od or whatever, and eliminates a class of errors where your indentation accidentally doesn't reflect your control flow accurately.

Your input doesn't appear to be large enough to justify a loop over your output file (and if it did I'd probably use a gdbm table anyway), so you can probably do something like this (tested very briefly):

#!/usr/local/cpython-3.3/bin/python

with open('/etc/crontab', 'r') as infile, open('output.txt', 'w') as outfile:
    seen = set()
    for line in infile:
        for word in line.split():
            if word not in seen:
                seen.add(word)
                outfile.write('{}
'.format(word))

How do I remove duplicate entries in my output file in Python?

Answers (1)

Related Questions