Michał Darowny
Michał Darowny

Reputation: 414

Page number in PyMuPDF multiprocessing with extract_text

So in pymupdf documentation states that PyMuPDF does not support running on multiple threads

So they use multiprocessing, and they do this weird thing with segments in example code:

    seg_size = int(num_pages / cpu + 1)
    seg_from = idx * seg_size
    seg_to = min(seg_from + seg_size, num_pages)
    for i in range(seg_from, seg_to):  # work through our page segment
        page = doc[i]
        # page.get_text("rawdict")  # use any page-related type of work here, eg

Why not load document first->get number of pages and then pass number to handler function? instead of using segments as in example code ? Would this cause issues?

def extract_text_from_page(args: Tuple[bytes, int]) -> Tuple[int, str]:
    pdf_stream, page_num = args
    # Open a new Document instance in this process
    doc = pymupdf.open(stream=pdf_stream)
    page = doc.load_page(page_num)  # Load the specific page
    text = page.get_text(sort=True)  # Extract text with sorting
    doc.close()  # Clean up
    return (page_num, text)

Upvotes: 0

Views: 45

Answers (0)

Related Questions