Reputation: 348

Speed up file matching based on names of files

so I have 2 directories with 2 different file types (eg .csv, .png) but with the same basename (eg 1001_12_15.csv, 1001_12_15.png). I have many thousands of files in each directory.

What I want to do is to get the full paths of files, after having matched the basenames and then DO something with th efull path of both files.

I am asking some help of how to speed up the procedure.

My approach is:

csvList=[a list with the full path of each .csv file]
pngList=[a list with the full path of each .png file]



for i in range(0,len(csvlist)):
    csv_base = os.path.basename(csvList[i])
    #eg 1001
    csv_id = os.path.splitext(fits_base)[0].split("_")[0]

    for j in range(0, len(pngList)):
        png_base = os.path.basename(pngList[j])
        png_id = os.path.splitext(png_base)[0].split("_")[0]
        if float(png_id) == float(csv_id):
            DO SOMETHING

more over I tried fnmatch something like:

for csv_file in csvList:
    try:
        csv_base = os.path.basename(csv_file)

        csv_id = os.path.splitext(csv_base)[0].split("_")[0]

        rel_path = "/path/to/file"
        pattern = "*" + csv_id + "*.png"

        reg_match = fnmatch.filter(pngList, pattern)
        reg_match=" ".join(str(x) for x in reg_match)
        if reg_match:
            DO something

It seems that using the nested for loops is faster. But I want it to be even faster. Are there any other approaches that I could speed up my code?

Upvotes: 2

Answers (2)

MisterMiyagi

Reputation: 50116

You can build up a search index in O(n), then seek items in it in O(1) each. If you have exact matches as your question implies, a flat lookup dict suffices:

from os.path import basename, splitext

png_lookup = {
    splitext(basename(png_path))[0] : png_path
    for png_path in pngList
}

This allows you to directly look up the png file corresponding to each csv file:

for csv_file in csvList:
    csv_id = splitext(basename(csv_file)[0]
    try:
        png_file = png_lookup[csv_id]
    except KeyError:
        pass
    else:
        # do something

In the end, you have an O(n) lookup construction and a separate O(n) iteration with a nested O(1) lookup. The total complexity is O(n) compared to your initial O(n^2).

Upvotes: 0

Piakkaa

Reputation: 818

first of all, optimize syntax on your existing loop like this

for csv in csvlist:
    csv_base = os.path.basename(csv)
    csv_id = os.path.splitext(csv_base)[0].split("_")[0]

    for png in pnglist:
        png_base = os.path.basename(png)
        png_id = os.path.splitext(png_base)[0].split("_")[0]
        if float(png_id) == float(csv_id):
            #do something here

nested loops are very slow because you need to run png loop n2 times

Then you can use list comprehension and array index to speed it up more

## create lists of processed values 
## so you dont have to keep running the os library
sv_base_list=[os.path.basename(csv) for csv in csvlist]
csv_id_list=[os.path.splitext(csv_base)[0].split("_")[0] for csv_base in csv_base_list]
png_base_list=[os.path.basename(png) for png in pnglist]
png_id_list=[os.path.splitext(png_base)[0].split("_")[0] for png_base in png_base_list]


## run a single loop with list.index to find matching pair and record base values array

csv_png_base=[(csv_base_list[csv_id_list.index(png_id)], png_base)\
                   for png_id,png_base in zip(png_id_list,png_base_list)\
                   if png_id in csv_id_list]

## csv_png_base contains a tuple contianing (csv_base,png_base)

this logic using list index reduces the loop count significantly and there is no repetitive os lib calls
list comprehension is slightly faster than normal loop

You can loop through the list and do something with the values eg

for csv_base,png_base in csv_png_base:
    #do something

pandas will do the job much much faster though because it will run the loop using a C library

Upvotes: 1

Speed up file matching based on names of files

Answers (2)

Related Questions