Page number in PyMuPDF multiprocessing with extract_text

Question

So in pymupdf documentation states that PyMuPDF does not support running on multiple threads

So they use multiprocessing, and they do this weird thing with segments in example code:

    seg_size = int(num_pages / cpu + 1)
    seg_from = idx * seg_size
    seg_to = min(seg_from + seg_size, num_pages)
    for i in range(seg_from, seg_to):  # work through our page segment
        page = doc[i]
        # page.get_text("rawdict")  # use any page-related type of work here, eg

Why not load document first->get number of pages and then pass number to handler function? instead of using segments as in example code ? Would this cause issues?

def extract_text_from_page(args: Tuple[bytes, int]) -> Tuple[int, str]:
    pdf_stream, page_num = args
    # Open a new Document instance in this process
    doc = pymupdf.open(stream=pdf_stream)
    page = doc.load_page(page_num)  # Load the specific page
    text = page.get_text(sort=True)  # Extract text with sorting
    doc.close()  # Clean up
    return (page_num, text)

Page number in PyMuPDF multiprocessing with extract_text

Answers (0)

Related Questions