Reputation: 35
I'm using pdfplumber to extract text from a pdf. I'm able to extract lines of text, but I'm having trouble extracting a paragraph. Here's the current code I have.
Example of text I want to extract:
Paragraph Title
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Enim facilisis gravida neque convallis a cras semper auctor neque.
with pdfplumber.open(path_to_pdf) as pdf:
pageno = 1
page = pdf.pages[pageno]
text = page.extract_text(x_tolerance=5)
lines = [x.lower().strip() for x in lines]
print(lines)
How can I alter this to extract paragraphs instead? Right now this would give me this. Basically it is adding each line to an array. ['Paragraph Title', 'lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et', 'dolore magna aliqua. enim facilisis gravida neque convallis a cras semper auctor neque.]
I want it to give me this. It would add the paragraph title and then paragraph to the array. ['Paragraph Title', 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Enim facilisis gravida neque convallis a cras semper auctor neque. ']
Upvotes: 3
Views: 1699
Reputation: 293
This workaround is not a real solution to identify the paragraphs but helps to detect their titles.
By exploiting page.extract_words(extra_attrs=["fontname", "size"])
you can analyze the text based on the font type and size. Then use this information to identify the positions of the headers and page.crop(...).extract_text()
to get the text between each header. The crop()
argument is the bounding box you construct based on consecutive pairs of header positions.
You can find here more details.
Upvotes: 0
Reputation: 1
As far as I can determine, pdf text extraction is just rubbish.You just get lines of text, no paragraphs or columns. There maybe great functions for tables as there is with docx tables, but nothing for run of the mill simple data extraction in its original paragraph layout.
Upvotes: -1