Ben
Ben

Reputation: 21

Is there a way to extract sentences after bold text in Python?

I have extracted some bold text from a pdf in python. Which works fine. But I want to extract also the sentence, or more then one sentence after the bold text, e.g. "Blue sky is what we see when we look up."

I can extract the blue sky part. But I'm not able to extract the "is what we see when we look up" part.

import pdfplumber 

with pdfplumber.open('C:/Users/somefile.pdf') as pdf: 
    for i in range(12, 15):
        text = pdf.pages[i]
        clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])
        print(clean_text.extract_text())

Upvotes: 2

Views: 990

Answers (1)

user7711283
user7711283

Reputation:

In my answer I will provide the code which uses the approach:

Try to iterate over text.chars and print entry which appears after bold char and until some delimiter – Olvin Roght

With the code below the task of extracting sentences which begin with bold character (or are entire bold) comes down to running:

get_boldSentences_from(PDFplumberPage)

This function returns a Python list with strings containing all of the text sections in the PDF document which begin with a bold character and end with a period:

def get_boldSentences_from(PDFplumberPage):
    #, startsWith='PROP:"bold"', endsWith='CHAR:"."'):
    lstDct = PDFplumberPage.chars
    extracting = False
    lstSentences = []
    strSentence  = None
    for dct in lstDct:
        if not extracting:
            if 'bold'.upper() in dct['fontname'].upper():
                extracting = True
                strSentence = dct['text']
            #:if 'bold'
        else: # extracting:
            char = dct['text'] 
            if  char == ".":
                strSentence += char
                lstSentences.append(strSentence)
                strSentence = ''
                extracting = False
            else:
                strSentence += char
            #:if/else
        #:if/else
    #:for dct in lstDct        
    return lstSentences
#:def

with pdfplumber.open(str_pdf_file, password='') as pdf:
    PDFplumberPage = pdf.pages[0]
    print( get_boldSentences_from(PDFplumberPage) )

An interesting approach, but sure hard to grasp for a newbie how it does what it does was posted as a comment to my answer:

You can apply certain optimizations to your code ;-) – Olvin Roght

After small debugging and further certain optimizations here the code which has an equivalent function to the code posted above:

def get_bold_sentences(page): 
    from itertools import takewhile, dropwhile
    DELIMITERS = ".!?"
    FONTSTYLE  = "BOLD"
    chars = iter(page.chars)  
    while True:
        sentence = "".join(
            char["text"] for char in takewhile(
                lambda char: char["text"] not in DELIMITERS,
                dropwhile(
                    lambda char: FONTSTYLE not in char["fontname"].upper(),
                    chars)
            )
        )
        if sentence:
            yield sentence
        else:
            break
import pdfplumber 
with pdfplumber.open(str_pdf_file, password="") as pdf:
    page = pdf.pages[0]
    print(*get_bold_sentences(page), sep="\n")

Notice that the function above is a generator function, so in order to get all of the returned value it is necessary to use list() on it. You can see from the print statement the difference. With specification of a separator in call to print() the sentences can be printed separated by newline.

Upvotes: 2

Related Questions