Felix P
Felix P

Reputation: 21

How can I read and search multiple textfiles so that I can store a list of files that match my search?

I hope you can help out a new learner of Python. I could not find my problem in other questions, but if so: apologies. What I basically want to do is this:

  1. Read a large number of text files and search each for a number of string terms.
  2. If the search terms are matched, store the corresponding file name to a new file called "filelist", so that I can tell the good files from the bad files.
  3. Export "filelist" to Excel or CSV.

Here is the code that I have so far:

#textfiles all contain only simple text e.g. "6 Apples"
filelist=[]
for file in os.listdir('C:/mydirectory/'):
    with open('C:/mydirectory/' + file, encoding="Latin1") as f:
        fine=f.read()
        
        if re.search('APPLES',fine) or re.search('ORANGE',fine) or re.search('BANANA',fine):    
          filelist.append(file)

listoffiles = pd.DataFrame(filelist)
writer = pd.ExcelWriter('ListofFiles.xlsx', engine='xlsxwriter')
listoffiles.to_excel(writer,sheet_name='welcome',index=False)
writer.save()
print(filelist)

Questions:

  1. Surely, there is a more elegant or time-efficient way? I need to do this for a large amount of files :D
  2. Related to the former, is there a way to solve the reading-in of files using pandas? Or is it less time efficient? For me as a STATA user, having a dataframe feels a bit more like home....
  3. I added the "Latin1" option, as some characters in the raw data create conflict in encoding. Is there a way to understand which characters are causing the problem? Can I get rid of this easily, e.g. by cutting of the first line beforehand (skiprow maybe)?

Upvotes: 2

Views: 48

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195623

Just couple of things to speed up the script:

1.) compile your regex beforehand, not every time in the loop (also use | to combine multiple strings to one regex!

2.) read files line by line, not all at once!

3.) Use any() to terminate search when you get first positive

For example:

import re
import os

filelist=[]
r = re.compile(r'APPLES|ORANGE|BANANA') # you can add flags=re.I for case insensitive search

for file in os.listdir('C:/mydirectory/'):
    with open('C:/mydirectory/' + file, 'r', encoding='latin1') as f:
        if any(r.search(line) for line in f):   # read files line by line, not all content at once
            filelist.append(file)               # add to list

# convert list to pandas, etc...

Upvotes: 4

Related Questions