f.read() does not read between lines

Question

I use Python 3.6. I have some strings I want to check in a read.txt file. The problem is that the .txt file is written such that sentences may be cut and put into a different line. For example:

bla bla bla internal control over financial reporting or an attestation
report of our auditors

The .txt file cuts the sentence after the word "attestation" and starts with "report" in the following line. I want to look for the entire sentence in the file, irrespective of what line it is (and create var1=1 if the sentence is in the file, and 0 otherwise).

I use the following code to parse (and it seems I don't know how to specify that I don't bother about lines):

string1 = 'internal control over financial reporting or an attestation report of our auditors'    
exemptions = []
for eachfile in file_list: #I have many .txt files in my directory
        with open(eachfile, 'r+', encoding='utf-8') as f:
            line2 = f.read()  # line2 should be a var with all the .txt file
            var1 = re.findall(str1, line2, re.I)  # find str1 in line2
            if len(re.findall(str1, line2, re.I)) > 0:
                exemptions.append('1')  # if it detects smthg, then append exemptions list with var1=1
            else:
                exemptions.append('0')  # otherwise var1= 0

Any idea of how to do that? I thought that by using the line2=f.read(), I was actually checking the whole .txt file, irrespective of lines, but it does not seem so....

Thank you anyways!

asongtoruin · Accepted Answer

You're assuming a newline is the same as a space - it's not. Try changing

line2 = f.read()

to

line2 = f.read().replace('
', ' ').replace('
', ' ')

This should replace any newlines in the file with spaces, thus allowing your search to work as intended.

You could similarly do

line2 = ' '.join(line.rstrip('
') for line in f)

You could instead modify your regex:

var1 = re.findall(str1.replace(' ', '\s+'), line2, re.I)  # find str1 in line2
if var1:
    exemptions.append('1')
else:
    exemptions.append('0')

In regex terms, \s is any spacing character, \s+ allows for multiple spaces or newlines.

f.read() does not read between lines

Answers (1)

Related Questions