Reputation: 414
So in pymupdf documentation states that PyMuPDF does not support running on multiple threads
So they use multiprocessing, and they do this weird thing with segments in example code:
seg_size = int(num_pages / cpu + 1)
seg_from = idx * seg_size
seg_to = min(seg_from + seg_size, num_pages)
for i in range(seg_from, seg_to): # work through our page segment
page = doc[i]
# page.get_text("rawdict") # use any page-related type of work here, eg
Why not load document first->get number of pages and then pass number to handler function? instead of using segments as in example code ? Would this cause issues?
def extract_text_from_page(args: Tuple[bytes, int]) -> Tuple[int, str]:
pdf_stream, page_num = args
# Open a new Document instance in this process
doc = pymupdf.open(stream=pdf_stream)
page = doc.load_page(page_num) # Load the specific page
text = page.get_text(sort=True) # Extract text with sorting
doc.close() # Clean up
return (page_num, text)
Upvotes: 0
Views: 45