Pranav
Pranav

Reputation: 11

Python pdfminer LAParams not able to extract bulletpoints as paras

I have a pdf file and i wanna parse text from it with pdfminer.The problem is LAParams is not able to extract bullet points as line.I can't figure out why. My pdf looks like this: pdf Out put looks like this:

def extract(filename):
    laparams = LAParams()
    header, footer = extract_hf(filename)
    for key in subsec_dict:
        for i, page_layout in enumerate(
                extract_pages(os.path.join('C:/Users/2030117/Desktop/trial_domain_extraction', filename),laparams=laparams)):
            flag=0
            if i == 0:
                continue
            page = []
            for paras in page_layout:
                if isinstance(paras, LTTextContainer):
                    if paras.get_text() in header or paras.get_text().strip() in footer:
                        continue

                    if paras.get_text().strip() != '\n' and paras.get_text().strip() != '':

                        lst = []
                        if bool([ele for ele in subsec_dict.get(key) if(ele in paras.get_text())]):
                            flag=1
                            #print(91)
                            lst = []

                            for box in paras:
                                if bool([ele for ele in subsec_dict.get(key) if(ele in box.get_text())]):
                                    lst.append(box.get_text().strip())

                                elif len(lst) != 0:

                                    if bool([ele for ele in sub_lst if(ele in box.get_text())]):
                                        break
                                    else:
                                        lst.append(box.get_text().strip())

                            print(' '.join(lst))
            if flag:
                break

    return

I have tried setting LAParams(detect_vertical=True,all_texts=True)

Upvotes: 0

Views: 48

Answers (0)

Related Questions