Reputation: 11
I have a pdf file and i wanna parse text from it with pdfminer.The problem is LAParams is not able to extract bullet points as line.I can't figure out why. My pdf looks like this: pdf Out put looks like this:
def extract(filename):
laparams = LAParams()
header, footer = extract_hf(filename)
for key in subsec_dict:
for i, page_layout in enumerate(
extract_pages(os.path.join('C:/Users/2030117/Desktop/trial_domain_extraction', filename),laparams=laparams)):
flag=0
if i == 0:
continue
page = []
for paras in page_layout:
if isinstance(paras, LTTextContainer):
if paras.get_text() in header or paras.get_text().strip() in footer:
continue
if paras.get_text().strip() != '\n' and paras.get_text().strip() != '':
lst = []
if bool([ele for ele in subsec_dict.get(key) if(ele in paras.get_text())]):
flag=1
#print(91)
lst = []
for box in paras:
if bool([ele for ele in subsec_dict.get(key) if(ele in box.get_text())]):
lst.append(box.get_text().strip())
elif len(lst) != 0:
if bool([ele for ele in sub_lst if(ele in box.get_text())]):
break
else:
lst.append(box.get_text().strip())
print(' '.join(lst))
if flag:
break
return
I have tried setting LAParams(detect_vertical=True,all_texts=True)
Upvotes: 0
Views: 48