ssc
ssc

Reputation: 11

How to loop through large amount of folder faster

I have to loop through a large number of folder at a given path (My_Path) (around 60 000). Then in each folder in My_Path, I have to check if a file name contains a certain date. I want to know if there is a faster way than looping one by one using the os library. (Around 1 hour)

My_Path:
-   Folder 1
              o File 1
              o File 2
              o File 3
              o …
-   Folder 2
-   Folder 3
-   …
-   Folder 60 000
import os
My_Path = r'\\...\...\...\...'
mylist2 = os.listdir(path)  # give a list of 60000 element
for folder in mylist2:
    mylist = os.listdir(My_Path + folder)  # give the list of all files in each folder
    for file in mylist:
        Check_Function(file)

The actual run takes around one hour and I want to know if there is an optimal solution.

Thanks !!

Upvotes: 1

Views: 1634

Answers (2)

rpanai
rpanai

Reputation: 13447

As others already suggested you can obtain the list lst of all files as in this answer. Then you can spin your function with multiprocessing.

import multiprocessing as mp


def parallelize(fun, vec, cores):
    with mp.Pool(cores) as p:
        res = p.map(fun, vec)
    return res

and run

res = parallelize(Check_Function, lst, mp.cpu_count()-1)

Update Given that I don't think Check_Function is cpu bounded you can use even more cores.

Upvotes: 2

ilmiacs
ilmiacs

Reputation: 2576

Try os.walk(), might be faster:

import os
My_Path = r'\\...\...\...\...'
for path, dirs, files in os.walk(My_Path): 
    for file in files:
        Check_Function(os.path.join(path, file))

If it is not, maybe it's your Check_Function eating up the cycles.

Upvotes: 2

Related Questions