Locating multiple files in large dataset in python

Question

I have a large repository of image files (~2 million, .jpg) with individual ids spread in multiple sub-dirs and I'm trying to locate and copy each image on a list containing a ~1,000 subset of these ids.

I'm still very new to Python so my first thought was to use os.walk to iterate through the 1k subset for each file, to see if any within the subset matched the id. This works, at least theoretically, but it seems incredibly slow at something like 3-5 images a second. The same seems to be the case for running through all of the files looking for one id at a time.

import shutil
import os
import csv

# Wander to Folder, Identify Files
for root, dirs, files in os.walk(ImgFolder):
    for file in files:
        fileName = ImgFolder + str(file)
# For each file, check dictionary for match
        with open(DictFolder, 'r') as data1:
            csv_dict_reader = csv.DictReader(data1)
            for row in csv.DictReader(data1):
                img_id_line = row['id_line']
                isIdentified = (img_id_line in fileName) and ('.jpg' in fileName)
# If id_line == file ID, copy file
                if isIdentified:
                    src = fileName + '.jpg'
                    dst = dstFolder + '.jpg'
                    shutil.copyfile(src,dst)
                else:
                    continue

I've been looking at trying to automate query searches instead, but the data is contained on a NAS and I have no easy way of indexing the files to make querying faster. The machine I'm running the code through is a W10 and thus I can't use the Ubuntu Find method which I gather is considerably better at this task.

Any way to speed up the process would be greatly appreciated!

Alex · Accepted Answer

Here's a couple of scripts that should do what you're looking for.

index.py

This script uses pathlib to walk through directories searching for files with a given extension. It will write a TSV file with two columns, filename and filepath.

import argparse
from pathlib import Path


def main(args):
    for arg, val in vars(args).items():
        print(f"{arg} = {val}")

    ext = "*." + args.ext
    index = {}
    with open(args.output, "w") as fh:
        for file in Path(args.input).rglob(ext):
            index[file.name] = file.resolve()
            fh.write(f"{file.name}	{file.resolve()}
")


if __name__ == "__main__":
    p = argparse.ArgumentParser()
    p.add_argument(
        "input",
        help="Top level folder which will be recursively "
        " searched for files ending with the value "
        "provided to `--ext`",
    )
    p.add_argument("output", help="Output file name for the index tsv file")
    p.add_argument(
        "--ext",
        default="jpg",
        help="Extension to search for. Don't include `*` or `.`",
    )
    main(p.parse_args())

search.py

This script will load the index (output from index.py) into a dicttionary, then it will load the CSV file into a dictionary, then for each id_line it will look for the filename in the index and attempt to copy it to the output folder.

import argparse
import csv
import shutil
from collections import defaultdict
from pathlib import Path


def main(args):
    for arg, val in vars(args).items():
        print(f"{arg} = {val}")

    if not Path(args.dest).is_dir():
        Path(args.dest).mkdir(parents=True)

    with open(args.index) as fh:
        index = dict(l.strip().split("	", 1) for l in fh)
    print(f"Loaded {len(index):,} records")

    csv_dict = defaultdict(list)

    with open(args.csv) as fh:
        reader = csv.DictReader(fh)
        for row in reader:
            for (k, v) in row.items():
                csv_dict[k].append(v)

    print(f"Searching for {len(csv_dict['id_line']):,} files")
    copied = 0
    for file in csv_dict["id_line"]:
        if file in index:
            shutil.copy2(index[file], args.dest)
            copied += 1
        else:
            print(f"!! File {file!r} not found in index")
    print(f"Copied {copied} files to {args.dest}")


if __name__ == "__main__":
    p = argparse.ArgumentParser()
    p.add_argument("index", help="Index file from `index.py`")
    p.add_argument("csv", help="CSV file with target filenames")
    p.add_argument("dest", help="Target folder to copy files to")
    main(p.parse_args())

How to run this:

python index.py --ext "jpg" "C:\path	o\image\folder" "index.tsv"
python search.py "index.tsv" "targets.csv" "C:\path	o\output\folder"

I would try this on one/two folders first to check that it has the expected results.

Locating multiple files in large dataset in python

Answers (2)

How to run this:

Related Questions