Reputation: 1177
I know there are many libraries to extract text from PDF. Specifically, I've been having some difficulty with pymupdf.
From the documentation here: https://pymupdf.readthedocs.io/en/latest/app4.html#sequencetypes
I was hoping to use select()
to pick an interval of pages, and then use getText()
This is the doc I am using linear_regression.pdf
import fitz
s = [1, 2]
doc = fitz.open('linear_regression.pdf')
selection = doc.select(s)
text = selection.getText(s)
But I get this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-23-c05917f260e7> in <module>()
6 # print(selection)
7 # text = doc.get_page_text(3, "text")
----> 8 text = selection.getText(s)
9 text
AttributeError: 'NoneType' object has no attribute 'getText'
So I'm assuming select()
is not being used right
thanks so much
Upvotes: 1
Views: 2006
Reputation: 5560
select
here, according to the documentation, modifies doc
internally and does not return anything. In Python, if a function does not explicitly return anything, it will return None
, which is why you see that error.
However, Document
provides a method called get_page_text
which allows you to get the text from a specific page (0 indexed). So for your example, you could write:
import fitz
s = [1, 2] # pages 2 and 3
doc = fitz.open('linear_regression.pdf')
text_by_page = [doc.get_page_text(i) for i in s]
Now, you have a list, where each item in the list is the text from a different desired page. A simple way to convert this to a string is:
text = ' '.join(text_by_page)
which joins the two pages with a space between the last word of the first page and the first word of the last (as if there was no page break at all).
Upvotes: 3