Using os.walk to create a filelist for each directory

Question

I am attempting to use os.walk to create a list of files per subdirectory, and, execute a function to merge all pdf's in each directory list. The current script appends subsequent directories to the existing list with each loop. So, pdfs in directory1 are merged successfully, but, the list for directory2 includes the pdfs from directory1 etc. I want it to refresh the list of files for each directory. Here is the script I am using currently:

    import PyPDF2
    import os
    import sys

    if len(sys.argv) > 1:
        SearchDirectory = sys.argv[1]
        print("I'm looking for PDF's in ", SearchDirectory)
    else:
        print("Please tell me the directory to look in")
        sys.exit()

    pdfWriter = PyPDF2.PdfFileWriter()


    for root, dirs, files in os.walk(SearchDirectory):
        dirs.sort()
        for file in files:
            files.sort()
            pdfFiles = []
            if file.endswith('.pdf') and ((os.path.basename(root)) == "frames"):
                print("Discovered this pdf: ", os.path.join(root, file))
                pdfFiles.append(os.path.join(root, file))

            if pdfFiles:
                for file in pdfFiles:
                    pdfFileObj = open(file, 'rb')
                    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

                    for pageNum in range(0, pdfReader.numPages):
                        pageObj = pdfReader.getPage(pageNum)
                        pdfWriter.addPage(pageObj)
                        pdfOutput = open((os.path.split(os.path.realpath(root))[0]) + ".pdf", "wb")
                        pdfWriter.write(pdfOutput)
                        pdfOutput.close()

                print("The following pdf has been successfully appended:", os.path.join(root, file))
            else:
                print("No pdfs found in this directory:", root)

Tomalak · Accepted Answer

The os.walk loop iterates once per directory. So you want to create a new PDFWriter for every directory.

It's also a good idea to use continue to bail out of the loop as soon as possible, this keeps the nesting flat.

Names that start with a capital letter are reserved for classes, so it should be searchDirectory, written with a small s.

Finally, take advantage of with blocks for handling I/O - they automatically call .close() for you.

I'm not going to install PyPDF2 just for this question, but this approach looks reasonable:

for root, dirs, files in os.walk(searchDirectory):
    if not os.path.basename(root) == "frames":
        continue

    pdfFiles = [os.path.join(root, file) for file in sorted(files)]

    if not pdfFiles:
        continue

    pdfWriter = PyPDF2.PdfFileWriter()
    outputFile = os.path.split(os.path.realpath(root))[0] + ".pdf"

    for file in pdfFiles:
        print("Discovered this pdf:", file)
        with open(file, 'rb') as pdfInput:
            pdfReader = PyPDF2.PdfFileReader(pdfInput)

            for page in pdfReader.pages:
                pdfWriter.addPage(page)

    with open(outputFile, "wb") as pdfOutput:
        pdfWriter.write(pdfOutput)

    print("%s files appended to %s" % (len(pdfFiles), outputFile))

Using os.walk to create a filelist for each directory

Answers (1)

Related Questions