Reputation: 643
I have the following code:
import fileinput, os, glob, re
# Find text file to search in. Open.
filename = str(glob.glob('*.txt'))[2:][:-2]
print("found " + filename + ", opening...")
f = open(filename, 'r')
# Create output csv write total found occurrences of search string after name of search string
with open(filename[:-4] + 'output.csv','w') as output:
output.write("------------Group 1----------\n")
output.write(("String 1,") + str((len(re.findall(r's5 .*w249 w1025 w301 w1026 .*',f.read())))) +"\n")
output.write(("String 1 reverse,") + str((len(re.findall(r's5 .*w1026 w301 w1025 w249 .*',f.read())))) +"\n")
# close and finish
f.close
output.close
It successfully finds the first string and writes the total count to the output file, but it writes zero finds for 'String 1 reverse', even though it should find 1000's.
It works if I insert this between searching for String 1 and String 1 reverse:
f.close
f = open(filename, 'r')
i.e. I close the read file and then open it again.
I don't want to have to add this after each search line, what's going on? Is it something to do with caching the open file or cache in regex?
Thanks
Upvotes: 0
Views: 598
Reputation: 4972
I do not have samples to test your example, but I suspect that the issue comes from:
output.write(("String 1,") + str((len(re.findall(r's5 .*w249 w1025 w301 w1026 .*',f.read())))) +"\n")
output.write(("String 1 reverse,") + str((len(re.findall(r's5 .*w1026 w301 w1025 w249 .*',f.read())))) +"\n")
You are doing f.read()
two times, which means that the entire file is read, and the cursor is then set at the end of the file. The second f.read()
will return an empty string, because there is no more data to read.
You have to remember that reading a file means that the reading cursor (the position attached to the file descriptor) will change of +n
bytes after reading n
bytes. With no arguments f.read()
will read for the entire file size bytes, and leave the cursor at end of file.
You have two solutions:
Store the file content in a variable (Eg: content = f.read()
) and perform your searches on that variable.
Use the file seek features:
To change the file object’s position, use f.seek(offset, from_what). The position is computed from adding offset to a reference point; the reference point is selected by the from_what argument. A from_what value of 0 measures from the beginning of the file, 1 uses the current file position, and 2 uses the end of the file as the reference point. from_what can be omitted and defaults to 0, using the beginning of the file as the reference point.
https://docs.python.org/3/tutorial/inputoutput.html
The first solution is actually recommended: you don't need to read the file more than once, and seeking features are mostly used for large file operations.
Here is a fixed version of your code following that recommendation:
import fileinput, os, glob, re
# Find text file to search in. Open.
filename = str(glob.glob('*.txt'))[2:][:-2]
print("found " + filename + ", opening...")
content = open(filename, 'r').read()
# Create output csv write total found occurrences of search string after name of search string
with open(filename[:-4] + 'output.csv','w') as output:
output.write("------------Group 1----------\n")
output.write(("String 1,") + str((len(re.findall(r's5 .*w249 w1025 w301 w1026 .*',content)))) +"\n")
output.write(("String 1 reverse,") + str((len(re.findall(r's5 .*w1026 w301 w1025 w249 .*',content)))) +"\n")
Optimization: note that you don't need to close()
on variables now, as you keep no reference of the file instances.
Upvotes: 0
Reputation: 174632
Once you do a file.read()
, the entire file is read and the pointer is at the end of the file; which is why the second line doesn't return any results.
You need to read the contents first, then run your analysis:
print("found " + filename + ", opening...")
f = open(filename, 'r')
contents = f.read()
f.close() # -- note f.close() not f.close
results_a = re.findall(r's5 .*w249 w1025 w301 w1026 .*',contents)
results_b = re.findall(r's5 .*w1026 w301 w1025 w249 .*',contents)
with open(filename[:-4] + 'output.csv','w') as output:
output.write("------------Group 1----------\n")
output.write("String 1 {}\n".format(len(results_a)))
output.write("String 1 reverse, {}\n".format(len(results_b)))
You don't need output.close
(it didn't do anything in the first place), as the with statement automatically will close the file.
If you want to repeat this operation for all the files that match your pattern:
import glob
import re
import os
BASE_DIR = '/full/path/to/file/directory'
for file in glob.iglob(os.path.join(BASE_DIR, '*.txt')):
with open(file) as f:
contents = f.read()
filename = os.path.splitext(os.path.basename(f))[0]
results_a = re.findall(r's5 .*w249 w1025 w301 w1026 .*',contents)
results_b = re.findall(r's5 .*w1026 w301 w1025 w249 .*',contents)
with open(os.path.join(BASE_DIR, '{}output.csv'.format(filename), 'w') as output:
output.write("------------Group 1----------\n")
output.write("String 1 {}\n".format(len(results_a)))
output.write("String 1 reverse, {}\n".format(len(results_b)))
Upvotes: 1