Reputation: 136267
Given a digitally created PDF file, I would like to extract the text with the coordinates. A bounding box would be awesome, but an anchor + font / font size would work as well.
I've created an example PDF document so that it's easy to try things out / share the result.
pdftotext PDF-export-example.pdf -layout
gives this output. It already contains the text, but the coordinates are not in there.
PyPDF2 is even worse - also neither coordinates, nor font size and in this case not even ASCII art clues how the layout was:
from PyPDF2 import PdfFileReader
def text_extractor(path):
with open(path, "rb") as f:
pdf = PdfFileReader(f)
page = pdf.getPage(0)
text = page.extractText()
if __name__ == "__main__":
path = "PDF-export-example.pdf"
Another method to extract text, but without coordinates / font size. Thank you to Duck puncher for this one:
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = "utf-8"
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, "rb")
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(
text = retstr.getvalue()
return text
if __name__ == "__main__":
This one goes a bit more in the right direction as it can give the font name and size. But the coordinates are still missing (and the output is a bit verbose as it is character-by-character):
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
for page_layout in extract_pages("PDF-export-example.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Here I don't get anything at all:
from tabula import read_pdf
df = read_pdf("PDF-export-example.pdf")
Upvotes: 6
Views: 5506
Reputation: 6234
I've used PyMuPDF to extract page content as a list of single words with bbox information.
import fitz
doc ="PDF-export-example.pdf")
for page in doc:
wlist = page.getTextWords() # make the word list
(72.0250015258789, 72.119873046875, 114.96617889404297, 106.299560546875, 'Test', 0, 0, 0),
(120.26901245117188, 72.119873046875, 231.72618103027344, 106.299560546875, 'document', 0, 0, 1),
(72.0250015258789, 106.21942138671875, 101.52294921875, 120.18524169921875, 'Lorem', 1, 0, 0),
(103.98699951171875, 106.21942138671875, 132.00445556640625, 120.18524169921875, 'ipsum', 1, 0, 1),
(134.45799255371094, 106.21942138671875, 159.06637573242188, 120.18524169921875, 'dolor', 1, 0, 2),
(161.40098571777344, 106.21942138671875, 171.95208740234375, 120.18524169921875, 'sit', 1, 0, 3),
method separates a page’s text into “words” using spaces and line breaks as delimiters. Therefore the words in this lists contain no spaces or line breaks.
Return type: list
An item of this list looks like this:
(x0, y0, x1, y1, "word", block_no, line_no, word_no)
Where the first 4 items are the float coordinates of the words’s bbox. The last three integers provide some more information on the word’s whereabouts.
A Note on the Name fitz
The standard Python import statement for PyMuPDF library is import fitz
. This has a historical reason:
The original rendering library for MuPDF was called Libart.
After Artifex Software acquired the MuPDF project, the development focus shifted on writing a new modern graphics library called Fitz. Fitz was originally intended as an R&D project to replace the aging Ghostscript graphics library, but has instead become the rendering engine powering MuPDF.
Upvotes: 5
Reputation: 30579
You can parse the output of poppler's pdftotext
with the -bbox
import subprocess
from lxml import etree
file = 'PDF-export-example.pdf'
xml = etree.fromstring(subprocess.check_output(['pdftotext', '-bbox', file , '-']))
for pn, page in enumerate(xml.findall('.//{}page')):
for word in page.findall('{}word'):
print(pn, word.get('xMin'), word.get('yMin'),
word.get('xMax'), word.get('yMax'), word.text)
0 72.025000 72.124000 114.977000 105.780000 Test
0 120.269000 72.124000 231.737000 105.780000 document
0 72.025000 106.220500 101.519500 119.755000 Lorem
0 103.987000 106.220500 132.001000 119.755000 ipsum
0 134.458000 106.220500 159.070000 119.755000 dolor
Upvotes: 4