Extract text from PDF files and preserve the orginal layout, in Python

Question

I want to extract text from the PDF files but the layout of text in the PDF should be maintained, like the images below. Images show results from the [github.com/JonathanLink/PDFLayoutTextStripper]. I tried the below code but it doesn't maintain the Layout. I want get results exactly the same way as shown in the images by using any of the Python libraries like PyPDF2, PDFPlumber, PDFminer etc. I tried all these libraries but didn't get the desired results. I need help in extracting the text from the PDF file exactly as is shown in the images.

from pdfminer.high_level import extract_text`
text = extract_text('test.pdf')
print(text)

Amature · Accepted Answer

You can preserve layout/indentation using PDFtotext package.

import pdftotext

with open("target_file.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# All pages
for text in pdf:
    print(text)

Extract text from PDF files and preserve the orginal layout, in Python

Answers (1)

Related Questions