Close read file and open again required in order to write search result string to output file

Question

I have the following code:

import fileinput, os, glob, re

# Find text file to search in. Open.
filename = str(glob.glob('*.txt'))[2:][:-2]
print("found " + filename + ", opening...")
f = open(filename, 'r')

# Create output csv write total found occurrences of search string after name of search string 
with open(filename[:-4] + 'output.csv','w') as output:    
    output.write("------------Group 1----------
")
    output.write(("String 1,") + str((len(re.findall(r's5 .*w249 w1025 w301 w1026 .*',f.read())))) +"
")
    output.write(("String 1 reverse,") + str((len(re.findall(r's5 .*w1026 w301 w1025 w249 .*',f.read())))) +"
")

# close and finish
f.close
output.close

It successfully finds the first string and writes the total count to the output file, but it writes zero finds for 'String 1 reverse', even though it should find 1000's.

It works if I insert this between searching for String 1 and String 1 reverse:

f.close
f = open(filename, 'r')

i.e. I close the read file and then open it again.

I don't want to have to add this after each search line, what's going on? Is it something to do with caching the open file or cache in regex?

Thanks

Fabien · Accepted Answer

I do not have samples to test your example, but I suspect that the issue comes from:

 output.write(("String 1,") + str((len(re.findall(r's5 .*w249 w1025 w301 w1026 .*',f.read())))) +"
")
 output.write(("String 1 reverse,") + str((len(re.findall(r's5 .*w1026 w301 w1025 w249 .*',f.read())))) +"
")

You are doing f.read() two times, which means that the entire file is read, and the cursor is then set at the end of the file. The second f.read() will return an empty string, because there is no more data to read.

You have to remember that reading a file means that the reading cursor (the position attached to the file descriptor) will change of +n bytes after reading n bytes. With no arguments f.read() will read for the entire file size bytes, and leave the cursor at end of file.

You have two solutions:

Store the file content in a variable (Eg: content = f.read()) and perform your searches on that variable.
Use the file seek features:

To change the file object’s position, use f.seek(offset, from_what). The position is computed from adding offset to a reference point; the reference point is selected by the from_what argument. A from_what value of 0 measures from the beginning of the file, 1 uses the current file position, and 2 uses the end of the file as the reference point. from_what can be omitted and defaults to 0, using the beginning of the file as the reference point.

https://docs.python.org/3/tutorial/inputoutput.html

The first solution is actually recommended: you don't need to read the file more than once, and seeking features are mostly used for large file operations.

Here is a fixed version of your code following that recommendation:

import fileinput, os, glob, re

# Find text file to search in. Open.
filename = str(glob.glob('*.txt'))[2:][:-2]
print("found " + filename + ", opening...")
content = open(filename, 'r').read()

# Create output csv write total found occurrences of search string after name of search string 
with open(filename[:-4] + 'output.csv','w') as output:    
    output.write("------------Group 1----------
")
    output.write(("String 1,") + str((len(re.findall(r's5 .*w249 w1025 w301 w1026 .*',content)))) +"
")
    output.write(("String 1 reverse,") + str((len(re.findall(r's5 .*w1026 w301 w1025 w249 .*',content)))) +"
")

Optimization: note that you don't need to close() on variables now, as you keep no reference of the file instances.

Close read file and open again required in order to write search result string to output file

Answers (2)

Related Questions