Reputation: 21
I'm very new to Python. I am trying to extract data from a text file in the format:
85729 block addressing index approximate text retrieval
85730 automatic query expansion based divergence etc...
The output text file is a list of the words but with no duplicate entries. The text file that is input can have duplicates. The output will look like this:
block
addressing
index
approximate
etc....
With my code so far, I am able to get the list of words but the duplicates are included. I try to check for duplicates before I enter a word into the output file but the output does not reflect that. Any suggestions? My code:
infile = open("paper.txt", 'r')
outfile = open("vocab.txt", 'r+a')
lines = infile.readlines()
for i in lines:
thisline = i.split()
for word in thisline:
digit = word.isdigit()
found = False
for line in outfile:
if word in line:
found = True
break
if (digit == False) and (found == False ):
outfile.write(word);
outfile.write("\n");
I don't understand how for loops are closed in Python. In C++ or Java, the curly braces can be used to define the body of a for loop but I'm not sure how its done in Python. Can anyone help?
Upvotes: 1
Views: 123
Reputation: 7177
Python loops are closed by dedenting; the whitespace on the left has semantic meaning. This saves you from furiously typing curly braces or do/od or whatever, and eliminates a class of errors where your indentation accidentally doesn't reflect your control flow accurately.
Your input doesn't appear to be large enough to justify a loop over your output file (and if it did I'd probably use a gdbm table anyway), so you can probably do something like this (tested very briefly):
#!/usr/local/cpython-3.3/bin/python
with open('/etc/crontab', 'r') as infile, open('output.txt', 'w') as outfile:
seen = set()
for line in infile:
for word in line.split():
if word not in seen:
seen.add(word)
outfile.write('{}\n'.format(word))
Upvotes: 1