Saverio Vasapollo
Saverio Vasapollo

Reputation: 53

PyMuPdf Bookmarks

I have a script that combines a bunch of PDFs into a single file, using PyPDF2, all good but on the company network is really slow. I then tried PyMuPdf and it is 100 times faster, but bookmarks and metadata are not copied automatically. Is there an argument to pass or something to say "while you are copying, also don't forget the bookmarks and metadata buddy"?

A bit of code here:

def pdfMerge(try_again):
    start = time.time()
    result = fitz.open()
    for pdf in sorted_list:
        print(pdf)
        with fitz.open(pdf) as file_temp:
            result.insert_pdf(file_temp)
    if try_again == 0:
        formatted_name = f"{job_number}-Combined Set-{date}.pdf"
    else:
        formatted_name = f"{job_number}-Combined Set-{date2}.pdf"
    result.save(formatted_name)
    end = time.time()
    print(end - start)
    return formatted_name

I am also open to other options such as pikepdf (which seems better supported).

Thanks!

EDIT: I changed the code:

def pdfMerge(try_again):
    start = time.time()
    toc = []
    result = fitz.open()
    for pdf in sorted_list:
        print(pdf)
        with fitz.open(pdf) as file_temp:
            bookmarks = file_temp.get_toc()
            file_temp.set_toc(bookmarks)
            result.insert_pdf(file_temp)
            print(bookmarks)
            bookmarks = ''
    if try_again == 0:
        formatted_name = f"{job_number}-RGB-Combined Set-{date}.pdf"
    else:
        formatted_name = f"{job_number}-RGB-Combined Set-{date2}.pdf"
    result.save(formatted_name)
    end = time.time()
    print(end - start)
    return formatted_name

The print(bookmarks) shows exactly what I need, but the combined PDF is still empty. What am I doing wrong?

EDIT 2: Here is my new function:

def pdfMerge(try_again):
    start = time.time()
    toc = []
    result = fitz.open()
    bookmarks_list = []
    for pdf in sorted_list:
        with fitz.open(pdf) as file_temp:
            bookmarks = file_temp.get_toc()
            print(bookmarks)
            bookmarks_list.append(bookmarks)
            result.insert_pdf(file_temp)
    if try_again == 0:
        formatted_name = f"{job_number}-RGB-Combined Set-{date}.pdf"
    else:
        formatted_name = f"{job_number}-RGB-Combined Set-{date2}.pdf"
    print(bookmarks_list)
    result.set_toc(bookmarks_list)
    result.save(formatted_name)
    end = time.time()
    print(end - start)
    return formatted_name

Which gives me this error:

  File "C:\Users\Sav...\Coding_Python\PdfMerge\RBGPdfMerge.0.11.10.py", line 112, in <module>
    pdfMerge(try_again)
  File "C:\Users\Sav...\Coding_Python\PdfMerge\RBGPdfMerge.0.11.10.py", line 88, in pdfMerge
    result.set_toc(bookmarks_list)
  File "C:\Users\Sav...\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\fitz\utils.py", line 1325, in set_toc
    raise ValueError("hierarchy level of item 0 must be 1")
ValueError: hierarchy level of item 0 must be 1

The same files are perfectly merged with pypdf and pypdf2.

Upvotes: 1

Views: 1978

Answers (1)

Jorj McKie
Jorj McKie

Reputation: 3110

As per the metadata:

They remain unchanged to be the metadata of the PDF into which you are merging pages from other files.

PyMuPDF allows you to view bookmarks as Table of Contents, which are very much like the same notion in a normal book: the bookmark items simply follow each other, have a level, a title and a page plus maybe some detail on exactly where on the target page it is pointing to.

So when you append PDFs to another one, you can simply also append its TOC to the TOC of the target PDF - all you must do is increasing its page numbers.

When done with appending files, set the resulting TOC (a simple Python list) to be the Table of Contents of the resulting file.

Here is an example taken directly from the PyMuPDF documentation:

>>> doc1 = fitz.open("file1.pdf")
>>> doc2 = fitz.open("file2.pdf")

>>> pages1 = len(doc1)  # save doc1's page count
>>> toc1 = doc1.get_toc(False)  # save TOC 1
>>> toc2 = doc2.get_toc(False)  # save TOC 2
>>> doc1.insert_pdf(doc2)  # doc2 at end of doc1
>>> for t in toc2:  # increase toc2 page numbers
        t[2] += pages1  # by old len(doc1)
>>> doc1.set_toc(toc1 + toc2)  # now result has total TOC

Upvotes: 3

Related Questions