Reputation: 136595
I know that
pdftotext -bbox foobar.pdf
creates a HTML file which contains content like
<word xMin="301.703800" yMin="104.483700" xMax="309.697000" yMax="115.283700">is</word>
<word xMin="313.046200" yMin="104.483700" xMax="318.374200" yMax="115.283700">a</word>
<word xMin="321.603400" yMin="104.483700" xMax="365.509000" yMax="115.283700">universal</word>
<word xMin="368.858200" yMin="104.483700" xMax="384.821800" yMax="115.283700">file</word>
<word xMin="388.291000" yMin="104.483700" xMax="420.229000" yMax="115.283700">format</word>
Hence each single word has a bounding box.
The Python package PDFminer in contrast seems only to be able to give the position of a block of text (see example).
How can I get the bounding boxes for each word in Python?
Upvotes: 7
Views: 1906
Reputation: 9032
disclaimer: I am the author of borb
, the package used in this answer.
You will need to do some kind of processing in order to get bounding boxes on a word-level. The problem is that a PDF (worst case scenario) only contains rendering instructions, and not structure-information.
Put simply, your PDF might contain (in pseudo-code):
The problem is that instruction 3 might contain anything from
In order to retrieve the bounding boxes of words, you'll need to do some processing (as mentioned before). You will need to render those instructions and split the text (preferably as it is being rendered) into words.
Then it's a matter of keeping track of the coordinates of the turtle, and you're set to go.
borb
does this (under the hood) for you.
from borb.pdf import PDF
from borb.toolkit import RegularExpressionTextExtraction
# this line builds a RegularExpressionTextExtraction
# this class listens to rendering instructions
# and performs the logic I mentioned in the text part of this answer
l: RegularExpressionTextExtraction = RegularExpressionTextExtraction("[^ ]+")
# now we can load the file and perform our processing
with open("input.pdf", "rb") as fh:
PDF.loads(fh, [l])
# now we just need to get the boxes out of it
# RegularExpressionTextExtraction returns a list of type PDFMatch
# this class can return a list of bounding boxes (should your
# regular expression ever need to be matched over separate lines of text)
for m in l.get_matches_for_page(0):
# here we just print the Rectangle
# but feel free to do something useful with it
print(m.get_bounding_boxes()[0])
borb
is an open source, pure Python PDF library that creates, modifies and reads PDF documents. You can download it using:
pip install borb
Alternatively, you can build from source by forking/downloading the GitHub repository.
Upvotes: 1