Reputation: 8520
I have a directory tree with csv files, and I want to return files following this pattern (the pattern is from somewhere else, so I will need to stick to that):
"foo"
should match foo/**/*.csv
and/or foo.csv
, so that
"foo/bar"
matches e.g. foo/bar.csv
, foo/bar/baz.csv
and foo/bar/baz/qux.csv
So far, I have been iterating through the directory tree twice; first looking for files and then for directories:
from glob import iglob
from itertools import chain
import os
path = "csv_dir"
pattern = "foo/bar"
pattern = os.path.join(*pattern.split("/"))
path_with_pattern = os.path.join(path, pattern)
# first get all csv files in foo/bar and subdirs
files_1 = chain.from_iterable(iglob(os.path.join(root, '*.csv'))
for root, dirs, files in os.walk(path_with_pattern))
# then get all foo/bar.csv files
files_2 = chain.from_iterable(iglob(os.path.join(root, pattern + '.csv'))
for root, dirs, files in os.walk(path))
for f in chain(files_1, files_2):
print(f)
This works, but it feels stupid to iterate the tree twice. Is there a clever file matching method I have missed? Or a simple way to filter them out if I start by getting all csv files in the tree?
Upvotes: 0
Views: 1695
Reputation: 26
If it is possible for you to use a different library, I suggest using regular expressions as I have found them to be pretty useful when iterating through a directory to find specific file and directory naming patterns.
Here is a little information on regular expressions if they are unfamiliar.
Python Documentation on regex: https://docs.python.org/2/library/re.html
Regex tool testing (works well, though it says it's for Ruby): http://rubular.com/
import os
import re
def searchDirectory(cwd,searchParam,searchResults):
dirs = os.listdir(cwd)
for dir in dirs:
fullpath = os.path.join(cwd,dir)
if os.path.isdir(fullpath):
searchDirectory(fullpath,searchParam,searchResults)
if re.search(searchParam,fullpath):
searchResults.append(fullpath)
The function will iterate through a directory's contents and make a recursive call if and only if the current item is another directory. Afterwards, it will perform a regular expression search over the path of the current item. It will only access an item in a directory a single time.
I store the paths in a list for simplicity's sake, but you could change what the action performed with these paths is. This can change in the if statement checking for a regular expression match.
if re.search(searchParam,fullpath):
searchResults.append(fullpath)
I ran the code below with a small test directory.
searchParam = r'(foo\\bar\\.*\.txt|foo\\.*bar\.txt)'
root = os.getcwd();
searchResults = [];
searchDirectory(root,searchParam,searchResults)
print searchResults
My results after running:
<homePath>\foo\bar\baz.txt
<homePath>\foo\bar\biz\qua.txt
<homePath>\foo\bar.txt
<homePath>\foo\baz\bar.txt
As a note, I am using Python 2.7 with the Anaconda distribution.
Edit: I used text files to make the directory quick, but if you change the extension in the regular expression it will still work.
Upvotes: 1