user9625519
user9625519

Reputation: 31

What's the best way to extract text from pdf in python without changing the layout and format?

I want text with exact format and layout from pdf.
If pdf to text is not the direct choice, is it possible to do pdf -> xml -> text?
I have already tried PyPDF2, pdfminer and pdftotxt. Even I've tried using AWS textract and got incorrect layout.
Basically if I can construct sentence from the text extracted from pdf, that's enough.
I used Zamzar API which gives exact output but they're quiet expensive. Any possible solution?

Upvotes: 0

Views: 453

Answers (1)

mphil4
mphil4

Reputation: 105

If you are looking to keep the structure of the PDF but not the font, colour, size etc., then try the pdftables_api library. This should hold the layout of your PDF. Convert PDF to CSV as a CSV file is just a comma seperated text file.

If you are looking to keep font, colour etc., Zamzar API is probably your best option.

Upvotes: 0

Related Questions