Reputation: 69
I have used pdf plumber to extract the text out of pdf files as per the GitHub page (https://github.com/jsvine/pdfplumber) I went through all properties, I need to extract the title of the pdf if the metadata is not present.
or any other way we can achieve this using python
import pdfplumber
pdf = pdfplumber.open(r'1.pdf')
page = pdf.pages[0]
text = page.extract_text()
print(page.chars[0])
Upvotes: 3
Views: 2177
Reputation: 312
I have found the below approach
import pdfplumber
pdf = pdfplumber.open(r'1.pdf')
page = pdf.pages[0]
filtered = page.filter(lambda x: x.get("size", 0) > 20)
filtered.extract_text()
Upvotes: 3