steveJ
steveJ

Reputation: 2441

Fastest way to search files in a directory -Python

I have a multiple directory and each has files in thousands(10k+).. Lets pick one directory A having 10k files . I have some another directory(say it as B) that has files in thousands. I'm trying to find all files that appear in both A and B and also have a particular file extension (let's say .docx). I can apply a nested for loop easily, but as the files are in many thousands, it takes lot of time. Is there any faster way in python to perform it? Any specific algo you want to suggest or any snippet code ?

Note - I know how to search and get files in multiple ways, I am asking suggestion for the fastest approach, files are in millions and iterating through each again and again will cost resource..

Upvotes: 0

Views: 2949

Answers (3)

glibdud
glibdud

Reputation: 7840

The canonical method for comparing directories in python appears to be filecmp.dircmp().

cmp = filecmp.dircmp('/path/to/A', '/path/to/B')
matchingfiles = [filename for filename in cmp.common_files if filename.endswith('.docx')]

I can't speak specifically to its performance, but I would assume it's implemented in a way that will be more efficient than nested for loops.

Upvotes: 1

iBug
iBug

Reputation: 37227

Try the glob module:

import glob
glob.glob('/*')

Output (Ubuntu 18.04):

['/bin', '/boot', '/cache', '/data', '/dev', '/etc', '/home', '/init', '/lib', '/lib64', '/media', '/mnt', '/opt', '/proc', '/root', '/run', '/sbin', '/snap', '/srv', '/sys', '/tmp', '/usr', '/var']

Of course, you can glob something else:

glob.glob("*.docx")

Upvotes: 0

kevh
kevh

Reputation: 323

You can something like this:

import os
[x for x in os.listdir('A') if x.endswith('.docx')]

This will select the '.docx' files in the 'A' folder.

Upvotes: 0

Related Questions