Reputation: 21
I have extracted some bold text from a pdf in python. Which works fine. But I want to extract also the sentence, or more then one sentence after the bold text, e.g. "Blue sky is what we see when we look up."
I can extract the blue sky part. But I'm not able to extract the "is what we see when we look up" part.
import pdfplumber
with pdfplumber.open('C:/Users/somefile.pdf') as pdf:
for i in range(12, 15):
text = pdf.pages[i]
clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])
print(clean_text.extract_text())
Upvotes: 2
Views: 990
Reputation:
In my answer I will provide the code which uses the approach:
Try to iterate over text.chars and print entry which appears after bold char and until some delimiter – Olvin Roght
With the code below the task of extracting sentences which begin with bold character (or are entire bold) comes down to running:
This function returns a Python list with strings containing all of the text sections in the PDF document which begin with a bold character and end with a period:
def get_boldSentences_from(PDFplumberPage):
#, startsWith='PROP:"bold"', endsWith='CHAR:"."'):
lstDct = PDFplumberPage.chars
extracting = False
lstSentences = []
strSentence = None
for dct in lstDct:
if not extracting:
if 'bold'.upper() in dct['fontname'].upper():
extracting = True
strSentence = dct['text']
#:if 'bold'
else: # extracting:
char = dct['text']
if char == ".":
strSentence += char
lstSentences.append(strSentence)
strSentence = ''
extracting = False
else:
strSentence += char
#:if/else
#:if/else
#:for dct in lstDct
return lstSentences
#:def
with pdfplumber.open(str_pdf_file, password='') as pdf:
PDFplumberPage = pdf.pages[0]
print( get_boldSentences_from(PDFplumberPage) )
An interesting approach, but sure hard to grasp for a newbie how it does what it does was posted as a comment to my answer:
You can apply certain optimizations to your code ;-) – Olvin Roght
After small debugging and further certain optimizations here the code which has an equivalent function to the code posted above:
def get_bold_sentences(page):
from itertools import takewhile, dropwhile
DELIMITERS = ".!?"
FONTSTYLE = "BOLD"
chars = iter(page.chars)
while True:
sentence = "".join(
char["text"] for char in takewhile(
lambda char: char["text"] not in DELIMITERS,
dropwhile(
lambda char: FONTSTYLE not in char["fontname"].upper(),
chars)
)
)
if sentence:
yield sentence
else:
break
import pdfplumber
with pdfplumber.open(str_pdf_file, password="") as pdf:
page = pdf.pages[0]
print(*get_bold_sentences(page), sep="\n")
Notice that the function above is a generator function, so in order to get all of the returned value it is necessary to use list()
on it. You can see from the print statement the difference. With specification of a separator in call to print()
the sentences can be printed separated by newline.
Upvotes: 2