Reputation: 4105
Is there a better way to use glob.glob in python to get a list of multiple file types such as .txt, .mdown, and .markdown? Right now I have something like this:
projectFiles1 = glob.glob( os.path.join(projectDir, '*.txt') )
projectFiles2 = glob.glob( os.path.join(projectDir, '*.mdown') )
projectFiles3 = glob.glob( os.path.join(projectDir, '*.markdown') )
Upvotes: 265
Views: 340557
Reputation: 1
In one line :
IMG_EXTS = (".jpg", ".jpeg", ".jpe", ".jfif", ".jfi", ".jif",".JPG")
directory = './'
files = [ file for file in glob.glob(directory+'/*') if file.endswith(IMG_EXTS)]
Upvotes: -2
Reputation: 1
glob.glob('Folder/*[.png,jpg,jpeg,pdf]')
Worked for me to search for images and pdf
Upvotes: -3
Reputation: 71
You can use this:
project_files = []
file_extensions = ['txt','mdown','markdown']
for file_extension in file_extensions:
project_files.extend(glob.glob(projectDir + '*.' + file_extension))
Upvotes: 0
Reputation: 1
Maybe I'm missing something but if it's just plain glob maybe you could do something like this?
projectFiles = glob.glob(os.path.join(projectDir, '*.{txt,mdown,markdown}'))
Upvotes: -2
Reputation: 988
Easiest way is using itertools.chain
from pathlib import Path
import itertools
cwd = Path.cwd()
for file in itertools.chain(
cwd.rglob("*.txt"),
cwd.rglob("*.md"),
):
print(file.name)
Upvotes: 0
Reputation: 9863
So many answers that suggest globbing as many times as number of extensions, I'd prefer globbing just once instead:
from pathlib import Path
files = (p.resolve() for p in Path(path).glob("**/*") if p.suffix in {".c", ".cc", ".cpp", ".hxx", ".h"})
Upvotes: 102
Reputation: 1360
import os
import glob
projectFiles = [i for i in glob.glob(os.path.join(projectDir,"*")) if os.path.splitext(i)[-1].lower() in ['.txt','.markdown','.mdown']]
os.path.splitext will return filename & .extension
filename, .extension = os.path.splitext('filename.extension')
.lower() will convert a string into lowercase
Upvotes: -4
Reputation: 6386
Same answer as @BPL (which is computationally efficient) but which can handle any glob pattern rather than extension:
import os
from fnmatch import fnmatch
folder = "path/to/folder/"
patterns = ("*.txt", "*.md", "*.markdown")
files = [f.path for f in os.scandir(folder) if any(fnmatch(f, p) for p in patterns)]
This solution is both efficient and convenient. It also closely matches the behavior of glob
(see the documentation).
Note that this is simpler with the built-in package pathlib
:
from pathlib import Path
folder = Path("/path/to/folder")
patterns = ("*.txt", "*.md", "*.markdown")
files = [f for f in folder.iterdir() if any(f.match(p) for p in patterns)]
Upvotes: 15
Reputation: 29
This worked for me!
split('.')[-1]
above code separate the filename suffix (*.xxx) so it can help you
for filename in glob.glob(folder + '*.*'):
print(folder+filename)
if filename.split('.')[-1] != 'tif' and \
filename.split('.')[-1] != 'tiff' and \
filename.split('.')[-1] != 'bmp' and \
filename.split('.')[-1] != 'jpg' and \
filename.split('.')[-1] != 'jpeg' and \
filename.split('.')[-1] != 'png':
continue
# Your code
Upvotes: 1
Reputation: 1458
For example, for *.mp3
and *.flac
on multiple folders, you can do:
mask = r'music/*/*.[mf][pl][3a]*'
glob.glob(mask)
The idea can be extended to more file extensions, but you have to check that the combinations won't match any other unwanted file extension you may have on those folders. So, be careful with this.
To automatically combine an arbitrary list of extensions into a single glob pattern, you can do the following:
def multi_extension_glob_mask(mask_base, *extensions):
mask_ext = ['[{}]'.format(''.join(set(c))) for c in zip(*extensions)]
if not mask_ext or len(set(len(e) for e in extensions)) > 1:
mask_ext.append('*')
return mask_base + ''.join(mask_ext)
mask = multi_extension_glob_mask('music/*/*.', 'mp3', 'flac', 'wma')
print(mask) # music/*/*.[mfw][pml][a3]*
Upvotes: 39
Reputation: 11130
We can use pathlib
; .glob
still doesn't support globbing multiple arguments or within braces (as in POSIX shells) but we can easily filter
the result.
For example, where you might ideally like to do:
# NOT VALID
Path(config_dir).glob("*.{ini,toml}")
# NOR IS
Path(config_dir).glob("*.ini", "*.toml")
you can do:
filter(lambda p: p.suffix in {".ini", ".toml"}, Path(config_dir).glob("*"))
which isn't too much worse.
Upvotes: 9
Reputation: 715
From previous answer
glob('*.jpg') + glob('*.png')
Here is a shorter one,
from glob import glob
extensions = ['jpg', 'png'] # to find these filename extensions
# Method 1: loop one by one and extend to the output list
output = []
[output.extend(glob(f'*.{name}')) for name in extensions]
print(output)
# Method 2: even shorter
# loop filename extension to glob() it and flatten it to a list
output = [p for p2 in [glob(f'*.{name}') for name in extensions] for p in p2]
print(output)
Upvotes: 2
Reputation: 1486
glob
returns a list: why not just run it multiple times and concatenate the results?
from glob import glob
project_files = glob('*.txt') + glob('*.mdown') + glob('*.markdown')
Upvotes: 135
Reputation: 1078
Use a list of extension and iterate through
from os.path import join
from glob import glob
files = []
extensions = ['*.gif', '*.png', '*.jpg']
for ext in extensions:
files.extend(glob(join("path/to/dir", ext)))
print(files)
Upvotes: 1
Reputation: 131587
Maybe there is a better way, but how about:
import glob
types = ('*.pdf', '*.cpp') # the tuple of file types
files_grabbed = []
for files in types:
files_grabbed.extend(glob.glob(files))
# files_grabbed is the list of pdf and cpp files
Perhaps there is another way, so wait in case someone else comes up with a better answer.
Upvotes: 240
Reputation: 1734
While Python's default glob doesn't really follow after Bash's glob, you can do this with other libraries. We can enable braces in wcmatch's glob.
>>> from wcmatch import glob
>>> glob.glob('*.{md,ini}', flags=glob.BRACE)
['LICENSE.md', 'README.md', 'tox.ini']
You can even use extended glob patterns if that is your preference:
from wcmatch import glob
>>> glob.glob('*.@(md|ini)', flags=glob.EXTGLOB)
['LICENSE.md', 'README.md', 'tox.ini']
Upvotes: 23
Reputation: 391
import glob
import pandas as pd
df1 = pd.DataFrame(columns=['A'])
for i in glob.glob('C:\dir\path\*.txt'):
df1 = df1.append({'A': i}, ignore_index=True)
for i in glob.glob('C:\dir\path\*.mdown'):
df1 = df1.append({'A': i}, ignore_index=True)
for i in glob.glob('C:\dir\path\*.markdown):
df1 = df1.append({'A': i}, ignore_index=True)
Upvotes: -4
Reputation: 2005
By the results I've obtained from empirical tests, it turned out that glob.glob
isn't the better way to filter out files by their extensions. Some of the reason are:
I've tested (for correcteness and efficiency in time) the following 4
different methods to filter out files by extensions and puts them in a list
:
from glob import glob, iglob
from re import compile, findall
from os import walk
def glob_with_storage(args):
elements = ''.join([f'[{i}]' for i in args.extensions])
globs = f'{args.target}/**/*{elements}'
results = glob(globs, recursive=True)
return results
def glob_with_iteration(args):
elements = ''.join([f'[{i}]' for i in args.extensions])
globs = f'{args.target}/**/*{elements}'
results = [i for i in iglob(globs, recursive=True)]
return results
def walk_with_suffixes(args):
results = []
for r, d, f in walk(args.target):
for ff in f:
for e in args.extensions:
if ff.endswith(e):
results.append(path_join(r,ff))
break
return results
def walk_with_regs(args):
reg = compile('|'.join([f'{i}$' for i in args.extensions]))
results = []
for r, d, f in walk(args.target):
for ff in f:
if len(findall(reg,ff)):
results.append(path_join(r, ff))
return results
By running the code above on my laptop I obtained the following auto-explicative results.
Elapsed time for '7 times glob_with_storage()': 0.365023 seconds.
mean : 0.05214614
median : 0.051861
stdev : 0.001492152
min : 0.050864
max : 0.054853
Elapsed time for '7 times glob_with_iteration()': 0.360037 seconds.
mean : 0.05143386
median : 0.050864
stdev : 0.0007847381
min : 0.050864
max : 0.052859
Elapsed time for '7 times walk_with_suffixes()': 0.26529 seconds.
mean : 0.03789857
median : 0.037899
stdev : 0.0005759071
min : 0.036901
max : 0.038896
Elapsed time for '7 times walk_with_regs()': 0.290223 seconds.
mean : 0.04146043
median : 0.040891
stdev : 0.0007846776
min : 0.04089
max : 0.042885
Results sizes:
0 2451
1 2451
2 2446
3 2446
Differences between glob() and walk():
0 E:\x\y\z\venv\lib\python3.7\site-packages\Cython\Includes\numpy
1 E:\x\y\z\venv\lib\python3.7\site-packages\Cython\Utility\CppSupport.cpp
2 E:\x\y\z\venv\lib\python3.7\site-packages\future\moves\xmlrpc
3 E:\x\y\z\venv\lib\python3.7\site-packages\Cython\Includes\libcpp
4 E:\x\y\z\venv\lib\python3.7\site-packages\future\backports\xmlrpc
Elapsed time for 'main': 1.317424 seconds.
The fastest way to filter out files by extensions, happens even to be the ugliest one. Which is, nested for
loops and string
comparison using the endswith()
method.
Moreover, as you can see, the globbing algorithms (with the pattern E:\x\y\z\**/*[py][pyc]
) even with only 2
extension given (py
and pyc
) returns also incorrect results.
Upvotes: 8
Reputation: 780
If you use pathlib
try this:
import pathlib
extensions = ['.py', '.txt']
root_dir = './test/'
files = filter(lambda p: p.suffix in extensions, pathlib.Path(root_dir).glob('**/*'))
print(list(files))
Upvotes: 0
Reputation: 1648
Yet another solution (use glob
to get paths using multiple match patterns
and combine all paths into a single list using reduce
and add
):
import functools, glob, operator
paths = functools.reduce(operator.add, [glob.glob(pattern) for pattern in [
"path1/*.ext1",
"path2/*.ext2"]])
Upvotes: 0
Reputation: 6602
For example:
import glob
lst_img = []
base_dir = '/home/xy/img/'
# get all the jpg file in base_dir
lst_img += glob.glob(base_dir + '*.jpg')
print lst_img
# ['/home/xy/img/2.jpg', '/home/xy/img/1.jpg']
# append all the png file in base_dir to lst_img
lst_img += glob.glob(base_dir + '*.png')
print lst_img
# ['/home/xy/img/2.jpg', '/home/xy/img/1.jpg', '/home/xy/img/3.png']
A function:
import glob
def get_files(base_dir='/home/xy/img/', lst_extension=['*.jpg', '*.png']):
"""
:param base_dir:base directory
:param lst_extension:lst_extension: list like ['*.jpg', '*.png', ...]
:return:file lists like ['/home/xy/img/2.jpg','/home/xy/img/3.png']
"""
lst_files = []
for ext in lst_extension:
lst_files += glob.glob(base_dir+ext)
return lst_files
Upvotes: -2
Reputation: 99
Here is one-line list-comprehension variant of Pat's answer (which also includes that you wanted to glob in a specific project directory):
import os, glob
exts = ['*.txt', '*.mdown', '*.markdown']
files = [f for ext in exts for f in glob.glob(os.path.join(project_dir, ext))]
You loop over the extensions (for ext in exts
), and then for each extension you take each file matching the glob pattern (for f in glob.glob(os.path.join(project_dir, ext)
).
This solution is short, and without any unnecessary for-loops, nested list-comprehensions, or functions to clutter the code. Just pure, expressive, pythonic Zen.
This solution allows you to have a custom list of exts
that can be changed without having to update your code. (This is always a good practice!)
The list-comprehension is the same used in Laurent's solution (which I've voted for). But I would argue that it is usually unnecessary to factor out a single line to a separate function, which is why I'm providing this as an alternative solution.
Bonus:
If you need to search not just a single directory, but also all sub-directories, you can pass recursive=True
and use the multi-directory glob symbol **
1:
files = [f for ext in exts
for f in glob.glob(os.path.join(project_dir, '**', ext), recursive=True)]
This will invoke glob.glob('<project_dir>/**/*.txt', recursive=True)
and so on for each extension.
1 Technically, the **
glob symbol simply matches one or more characters including forward-slash /
(unlike the singular *
glob symbol). In practice, you just need to remember that as long as you surround **
with forward slashes (path separators), it matches zero or more directories.
Upvotes: 9
Reputation: 143
I had the same issue and this is what I came up with
import os, sys, re
#without glob
src_dir = '/mnt/mypics/'
src_pics = []
ext = re.compile('.*\.(|{}|)$'.format('|'.join(['png', 'jpeg', 'jpg']).encode('utf-8')))
for root, dirnames, filenames in os.walk(src_dir):
for filename in filter(lambda name:ext.search(name),filenames):
src_pics.append(os.path.join(root, filename))
Upvotes: 1
Reputation: 907
this worked for me:
import glob
images = glob.glob('*.JPG' or '*.jpg' or '*.png')
Upvotes: -8
Reputation: 9779
One glob, many extensions... but imperfect solution (might match other files).
filetypes = ['tif', 'jpg']
filetypes = zip(*[list(ft) for ft in filetypes])
filetypes = ["".join(ch) for ch in filetypes]
filetypes = ["[%s]" % ch for ch in filetypes]
filetypes = "".join(filetypes) + "*"
print(filetypes)
# => [tj][ip][fg]*
glob.glob("/path/to/*.%s" % filetypes)
Upvotes: 1
Reputation: 22942
To glob
multiple file types, you need to call glob()
function several times in a loop. Since this function returns a list, you need to concatenate the lists.
For instance, this function do the job:
import glob
import os
def glob_filetypes(root_dir, *patterns):
return [path
for pattern in patterns
for path in glob.glob(os.path.join(root_dir, pattern))]
Simple usage:
project_dir = "path/to/project/dir"
for path in sorted(glob_filetypes(project_dir, '*.txt', '*.mdown', '*.markdown')):
print(path)
You can also use glob.iglob()
to have an iterator:
Return an iterator which yields the same values as glob() without actually storing them all simultaneously.
def iglob_filetypes(root_dir, *patterns):
return (path
for pattern in patterns
for path in glob.iglob(os.path.join(root_dir, pattern)))
Upvotes: 1
Reputation: 691
A one-liner, Just for the hell of it..
folder = "C:\\multi_pattern_glob_one_liner"
files = [item for sublist in [glob.glob(folder + ext) for ext in ["/*.txt", "/*.bat"]] for item in sublist]
output:
['C:\\multi_pattern_glob_one_liner\\dummy_txt.txt', 'C:\\multi_pattern_glob_one_liner\\dummy_bat.bat']
Upvotes: 6
Reputation: 574
After coming here for help, I made my own solution and wanted to share it. It's based on user2363986's answer, but I think this is more scalable. Meaning, that if you have 1000 extensions, the code will still look somewhat elegant.
from glob import glob
directoryPath = "C:\\temp\\*."
fileExtensions = [ "jpg", "jpeg", "png", "bmp", "gif" ]
listOfFiles = []
for extension in fileExtensions:
listOfFiles.extend( glob( directoryPath + extension ))
for file in listOfFiles:
print(file) # Or do other stuff
Upvotes: 4