Reputation: 1568
I am trying to extract words from a PDF into individual lines, but can only do this with Text files as demonstrated below.
Moreover, the rule is that I cannot convert PDF files to TXT then perform this operation. It must be done on PDF files.
with open('filename.txt','r') as f:
for line in f:
for word in line.split():
print(word)
If filename.txt has just "Hello World!", then this function returns:
Hello
World!
I need to do the same with searchable PDF files as well. Any help would be appreciated.
Upvotes: 0
Views: 1614
Reputation: 18358
You can use pdfreader to extract texts (plain and containing PDF operators) from PDF document
Here is a sample code extracting all the above from all document pages.
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
Just want to outline, that text in PDFs usually do not come as "words", they look like commands to a conforming PDF viewer where and how to put a glyph. Which means a single word may be displayed by several commands. Read more on that in PDF 1.7 docs sec.9 - Text
Upvotes: 0
Reputation: 131
Check out PyMuPDF. There's loads of stuff you can do, including get line by line text from a PDF using page.getText()
Upvotes: 1
Reputation: 42
when I saw filename.txt I got confused.
Since you are working with PDF below link might be helpful. See it helps
How to use PDFminer.six with python 3?
Upvotes: -1
Reputation: 1749
For the PDF, you should use pdf.miner or PyPDF2.
Here is a good article you can use to extract the text, and then you can use Anilkumar's method to extract line by line.
https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f
Upvotes: 1