Reputation: 1395
I am working with a very large text file (500MB+) and the code I have is outputting perfectly but I am getting a lot of duplicates. What I am looking to do is check the output file to see if the output exists before it writes to the file. I am sure it is just one line in an if statement, but I do not know python well and cannot figure out the syntax. Any help would be greatly appreciated.
Here is the code:
authorList = ['Shakes.','Scott']
with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
the_whole_file = open_file.read()
for x in authorList:
start_position = 0
while True:
start_position = the_whole_file.find('<A>'+x+'</A>', start_position)
if start_position < 0:
break
end_position = the_whole_file.find('</W>', start_position)
output_file.write(the_whole_file[start_position:end_position+4])
output_file.write("\n")
start_position = end_position + 4
Upvotes: 0
Views: 530
Reputation: 27585
I think you should process your file with an appropriate tool to treat a text: regular expressions.
import re
regx = re.compile('<A>(.+?)</A>.*?<W>.*?</W>')
with open('/Users/Desktop/2e.txt','rb') as open_file,\
open('/Users/Desktop/Poetrylist.txt','wb') as output_file:
remain = ''
seen = set()
while True:
chunk = open_file.read(65536) # 65536 == 16 x 16 x 16 x 16
if not chunk: break
for mat in regx.finditer(remain + chunk):
if mat.group(1) not in seen:
output_file.write( mat.group() + '\n' )
seen.add(mat.group(1))
remain = chunk[mat.end(0)-len(remain):]
Upvotes: 0
Reputation: 56961
My understanding is, you wish to skip the lines in the open_file which contains name of your authors when you want to write to output_file. If this is what you intend to do, then do it this way.
authorList = ['Shakes.','Scott']
with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
for line in open_file:
skip = 0
for author in authorList:
if author in line:
skip = 1
if not skip:
output_file.write(line)
Upvotes: 0
Reputation: 16900
Create a list holding every string to write. If you append it, check first if the item you append is already in the list or not.
Upvotes: 0
Reputation: 76765
I suggest that you simply keep track of which author data you have already seen, and only write it if you haven't seen it before. You can use a dict
to keep track.
authorList = ['Shakes.','Scott']
already_seen = {} # dict to keep track of what has been seen
with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
the_whole_file = open_file.read()
for x in authorList:
start_position = 0
while True:
start_position = the_whole_file.find('<A>'+x+'</A>', start_position)
if start_position < 0:
break
end_position = the_whole_file.find('</W>', start_position)
author_data = the_whole_file[start_position:end_position+4]
if author_data not in already_seen:
output_file.write(author_data + "\n")
already_seen[author_data] = True
start_position = end_position + 4
Upvotes: 1