Reputation: 137
I want to extract text from the PDF files but the layout of text in the PDF should be maintained, like the images below. Images show results from the [github.com/JonathanLink/PDFLayoutTextStripper].
I tried the below code but it doesn't maintain the Layout. I want get results exactly the same way as shown in the images by using any of the Python libraries like PyPDF2, PDFPlumber, PDFminer etc. I tried all these libraries but didn't get the desired results. I need help in extracting the text from the PDF file exactly as is shown in the images.
from pdfminer.high_level import extract_text`
text = extract_text('test.pdf')
print(text)
Upvotes: 2
Views: 8702
Reputation: 34
You can preserve layout/indentation using PDFtotext package.
import pdftotext
with open("target_file.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# All pages
for text in pdf:
print(text)
Upvotes: 1