jprockbelly
jprockbelly

Reputation: 1607

Python - efficient way to search file names based on multiple filters

I have a small bit of code to list file names that match a filter string. I'm trying to extend this to match against multiple filters. I have some working code which takes a very straight forward loop approach but it is sloooooooow.... basically running os.walk for every filter.

Given the function (shown below) is there a way to test against multiple filters at once, rather than 1 at a time? i.e. can I past a list of filter strings to find_files?

import os
import fnmatch

# stolen from http://stackoverflow.com/questions/8625991/use-python-os-walk-to-identify-a-list-of-files
def find_files(dir_look, filt):
    matches = []
    for root, dirnames, filenames in os.walk(dir_look):
      for filename in fnmatch.filter(filenames, filt):
          matches.append(os.path.join(root, filename))
    return matches

#create empty list to store results
filelist=[]

#some example filters, my real data has about 5000 filters
filts = [r'*60830007*',r'*60910259*',r'*60910299*']

#find files for each filter entry
for filter in filts:
    filelist.append(find_files(r'C:\some directory', filter))

EDIT:

I found a rather obvious way to speed things up by passing the list of filters to the function then testing each inside the os.walk

def find_files(dir_look, filters):
    matches = []
    for root, dirnames, filenames in os.walk(dir_look):
        for filt in filters:
            for filename in fnmatch.filter(filenames, filt):
                matches.append(os.path.join(root, filename))
    return matches

Upvotes: 2

Views: 1403

Answers (2)

nagordon
nagordon

Reputation: 1367

check out glob2. It is fast and robust.

Upvotes: -1

Ádám Nagy
Ádám Nagy

Reputation: 21

This answer will be about algorithms and data structures, instead of python programming.

  1. If you want to test lot of pattern against a string then you should choose a better structure for representation. Instead of char array we use suffix-trees. (For python implementations see this question.

  2. If some of your filters has common parts (especially if they have the same prefix) you should represent them as trie(s). So this way you can test simultaneously with more than one pattern. This solution creates an overhead of building the tree(s) but if you use the same filters multiple times then it's worth.

Upvotes: 2

Related Questions