Andreas Alamanos
Andreas Alamanos

Reputation: 348

Speed up file matching based on names of files

so I have 2 directories with 2 different file types (eg .csv, .png) but with the same basename (eg 1001_12_15.csv, 1001_12_15.png). I have many thousands of files in each directory.

What I want to do is to get the full paths of files, after having matched the basenames and then DO something with th efull path of both files.

I am asking some help of how to speed up the procedure.

My approach is:

csvList=[a list with the full path of each .csv file]
pngList=[a list with the full path of each .png file]



for i in range(0,len(csvlist)):
    csv_base = os.path.basename(csvList[i])
    #eg 1001
    csv_id = os.path.splitext(fits_base)[0].split("_")[0]

    for j in range(0, len(pngList)):
        png_base = os.path.basename(pngList[j])
        png_id = os.path.splitext(png_base)[0].split("_")[0]
        if float(png_id) == float(csv_id):
            DO SOMETHING

more over I tried fnmatch something like:

for csv_file in csvList:
    try:
        csv_base = os.path.basename(csv_file)

        csv_id = os.path.splitext(csv_base)[0].split("_")[0]

        rel_path = "/path/to/file"
        pattern = "*" + csv_id + "*.png"

        reg_match = fnmatch.filter(pngList, pattern)
        reg_match=" ".join(str(x) for x in reg_match)
        if reg_match:
            DO something

It seems that using the nested for loops is faster. But I want it to be even faster. Are there any other approaches that I could speed up my code?

Upvotes: 2

Views: 579

Answers (2)

MisterMiyagi
MisterMiyagi

Reputation: 50116

You can build up a search index in O(n), then seek items in it in O(1) each. If you have exact matches as your question implies, a flat lookup dict suffices:

from os.path import basename, splitext

png_lookup = {
    splitext(basename(png_path))[0] : png_path
    for png_path in pngList
}

This allows you to directly look up the png file corresponding to each csv file:

for csv_file in csvList:
    csv_id = splitext(basename(csv_file)[0]
    try:
        png_file = png_lookup[csv_id]
    except KeyError:
        pass
    else:
        # do something

In the end, you have an O(n) lookup construction and a separate O(n) iteration with a nested O(1) lookup. The total complexity is O(n) compared to your initial O(n^2).

Upvotes: 0

Piakkaa
Piakkaa

Reputation: 818

first of all, optimize syntax on your existing loop like this

for csv in csvlist:
    csv_base = os.path.basename(csv)
    csv_id = os.path.splitext(csv_base)[0].split("_")[0]

    for png in pnglist:
        png_base = os.path.basename(png)
        png_id = os.path.splitext(png_base)[0].split("_")[0]
        if float(png_id) == float(csv_id):
            #do something here

nested loops are very slow because you need to run png loop n2 times

Then you can use list comprehension and array index to speed it up more

## create lists of processed values 
## so you dont have to keep running the os library
sv_base_list=[os.path.basename(csv) for csv in csvlist]
csv_id_list=[os.path.splitext(csv_base)[0].split("_")[0] for csv_base in csv_base_list]
png_base_list=[os.path.basename(png) for png in pnglist]
png_id_list=[os.path.splitext(png_base)[0].split("_")[0] for png_base in png_base_list]


## run a single loop with list.index to find matching pair and record base values array

csv_png_base=[(csv_base_list[csv_id_list.index(png_id)], png_base)\
                   for png_id,png_base in zip(png_id_list,png_base_list)\
                   if png_id in csv_id_list]

## csv_png_base contains a tuple contianing (csv_base,png_base)
  • this logic using list index reduces the loop count significantly and there is no repetitive os lib calls
  • list comprehension is slightly faster than normal loop

You can loop through the list and do something with the values eg

for csv_base,png_base in csv_png_base:
    #do something

pandas will do the job much much faster though because it will run the loop using a C library

Upvotes: 1

Related Questions