user1438495
user1438495

Reputation: 1

Combine two lists of PDFs one to one using Python

I have created a series of PDF documents (maps) using data driven pages in ESRI ArcMap 10. There is a page 1 and page 2 for each map generated from separate *.mxd. So I have one list of PDF documents containing page 1 for each map and one list of PDF documents containing page 2 for each map. For example: Map1_001.pdf, map1_002.pdf, map1_003.pdf...map2_001.pdf, map2_002.pdf, map2_003.pdf...and so one.

I would like to append these maps, pages 1 and 2, together so that both page 1 and 2 are together in one PDF per map. For example: mapboth_001.pdf, mapboth_002.pdf, mapboth_003.pdf... (they don't have to go into a new pdf file (mapboth), it's fine to append them to map1)

For each map1_ *.pdf Walk through the directory and append map2_ *.pdf where the numbers (where the * is) in the file name match

There must be a way to do it using python. Maybe with a combination of arcpy, os.walk or os.listdir, and pyPdf and a for loop?

for pdf in os.walk(datadirectory):

      ??

Any ideas? Thanks kindly for your help.

Upvotes: 0

Views: 1368

Answers (4)

Patrick Maupin
Patrick Maupin

Reputation: 11

There are examples of how to to do this on the pdfrw project page at googlecode:

http://code.google.com/p/pdfrw/wiki/ExampleTools

Upvotes: 0

Hugh Bothwell
Hugh Bothwell

Reputation: 56634

This should properly find and collate all the files to be merged; it still needs the actual .pdf-merging code.

Edit: I have added pdf-writing code based on the pyPdf example code. It is not tested, but should (as nearly as I can tell) work properly.

Edit2: realized I had the map-numbering crossways; rejigged it to merge the right sets of maps.

import collections
import glob
import re

# probably need to install this module -
#   pip install pyPdf
from pyPdf import PdfFileWriter, PdfFileReader

def group_matched_files(filespec, reg, keyFn, dataFn):
    res = collections.defaultdict(list)
    reg = re.compile(reg)
    for fname in glob.glob(filespec):
        data = reg.match(fname)
        if data is not None:
            res[keyFn(data)].append(dataFn(data))
    return res

def merge_pdfs(fnames, newname):
    print("Merging {} to {}".format(",".join(fnames), newname))

    # create new output pdf
    newpdf = PdfFileWriter()

    # for each file to merge
    for fname in fnames:
        with open(fname, "rb") as inf:
            oldpdf = PdfFileReader(inf)
            # for each page in the file
            for pg in range(oldpdf.getNumPages()):
                # copy it to the output file
                newpdf.addPage(oldpdf.getPage(pg))

    # write finished output
    with open(newname, "wb") as outf:
        newpdf.write(outf)

def main():
    matches = group_matched_files(
        "map*.pdf",
        "map(\d+)_(\d+).pdf$",
        lambda d: "{}".format(d.group(2)),
        lambda d: "map{}_".format(d.group(1))
    )
    for map,pages in matches.iteritems():
        merge_pdfs((page+map+'.pdf' for page in sorted(pages)), "merged{}.pdf".format(map))

if __name__=="__main__":
    main()

Upvotes: 1

Makoto
Makoto

Reputation: 106410

A PDF file is structured in a different way than a plain text file. Simply putting two PDF files together wouldn't work, as the file's structure and contents could be overwritten or become corrupt. You could certainly author your own, but that would take a fair amount of time, and intimate knowledge of how a PDF is internally structured.

That said, I would recommend that you look into pyPDF. It supports the merging feature that you're looking for.

Upvotes: 1

richardhsu
richardhsu

Reputation: 878

I don't have any test pdfs to try and combine but I tested with a cat command on text files. You can try this out (I'm assuming unix based system): merge.py

import os, re
files = os.listdir("/home/user/directory_with_maps/")
files = [x for x in files if re.search("map1_", x)]
while len(files) > 0:
    current = files[0]
    search = re.search("_(\d+).pdf", current)
    if search:
        name = search.group(1)
        cmd = "gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=FULLMAP_%s.pdf %s map2_%s.pdf" % (name, current, name)
        os.system(cmd)
    files.remove(current)

Basically it goes through and grabs the maps1 list and then just goes through and assumes correct files and just goes through numbers. (I can see using a counter to do this and padding with 0's to get similar effect).

Test the gs command first though, I just grabbed it from http://hints.macworld.com/article.php?story=2003083122212228.

Upvotes: 0

Related Questions