user19560886
user19560886

Reputation: 39

Python, using pdfplumber, pdfminer packages extract text from pdf, bolded characters duplicates

Goal: extract Chinese financial report text

Implementation: Python pdfplumber/pdfminer package to extract PDF text to txt

problem: for PDF text in bold, corresponding extracted text in txt duplicates

Examples are as follows:

Such as the following PDF text: pdf text Python extracts to txt as: pdfplumber result

And I don't need to repeat the text, just normal text.

How should I do it, should I change the package or add a new function?

Please see the code and original pdf text below.

Additional: pdfplumber code:

import pdfplumber
 
def pdf2txt(filename, delLinebreaker=True):
    pageContent = ''
    showplace = ''
    try:    
        with pdfplumber.open(  filename  ) as pdf:
            page_count = len(pdf.pages)
            for page in pdf.pages:
                if delLinebreaker==True:
                    pageContent += page.extract_text().replace('\n', "")   
                else:
                    pageContent += page.extract_text()  
    except Exception as e:
        print( "file: ", filename, ', reason: ', repr(e) )
    return pageContent
 
pdf2txt(r"report.pdf", delLinebreaker=False)

pdfminer code:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.pdfpage import PDFPage
 
rsrcmgr = PDFResourceManager()
outfp = open(r"report.txt", 'w', encoding='utf-8')
device = TextConverter(rsrcmgr, outfp)
with open(r"Report.pdf", 'rb') as fp:
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
device.close()
outfp.close()

Result of pdfminer is: pdfminer result

the pdf file can download here in Shenzhen Stock Exchange official website http://www.szse.cn/disclosure/listed/bulletinDetail/index.html?9324ce3c-6072-499d-8798-b25d641b52ec

Upvotes: 1

Views: 1508

Answers (2)

vassiliev
vassiliev

Reputation: 643

With pdfplumber, it is an known bug and have been fixed since Oct 4, 2020, and been added to release of version 0.5.24.

The problem is in pdfminer.six, which is a core module of both pdfminer and pdfplumber.

Update your pdfplumber, then use
page.dedupe_chars().extract_text()
instead of
page.extract_text()

Upvotes: 0

Jorj McKie
Jorj McKie

Reputation: 3120

Using PyMuPDF, you are able to suppress pseudo-bold text like for example this:

import fitz  # import PyMuPDF

doc = fitz.open("input.pdf")
page = doc[0]  # example first page

# extract text including its coordinates
blocks = page.get_text("dict", sort=True, flags=fitz.TEXTFLAGS_TEXT)["blocks"]
old_bbox = fitz.EMPTY_RECT()  # store previous bbox here
old_text = ""  # store previous text here
for b in blocks:  # loop over text blocks
    for l in b["lines"]:  # lines in current block
        bbox = fitz.Rect(l["bbox"])  # line boundary box
        # text in line - remove leading trailing spaces where possible
        text = " ".join([s["text"].strip() for s in l["spans"]]).strip()
        # check if new bbox overlaps old bbox
        isect = abs(bbox & old_bbox) / abs(bbox)  # overlap ratio
        if text != old_text or isect < 0.5:  # text unequal or no overlap
            print(text)  # print text
        old_text = text  # store for next
        old_bbox = +bbox  # store for next

Previous code delivers this:

浙江精功科技股份有限公司 2017 年年度报告全文

浙江精功科技股份有限公司
2017 年年度报告
年年度报告
2018 年
年
年 04 月
月
1

instead of this:

浙江精功科技股份有限公司 2017 年年度报告全文

浙江精功科技股份有限公司
浙江精功科技股份有限公司
浙江精功科技股份有限公司
浙江精功科技股份有限公司
2017 年年度报告
年年度报告
年年度报告
年年度报告
2018 年
年
年
年 04 月
月
月
月
1

As you can see, there still are some duplications - even with the corrective logic: After 2017 年年度报告 follows 年年度报告, which probably duplicates the Chinese part of the previous. So to also catch these cases, your logic needs to be smarter still and also check for partial bbox overlaps and trailing text equality, like if old_text.endswith(text) .... Doing this delivers a better result:

浙江精功科技股份有限公司 2017 年年度报告全文

浙江精功科技股份有限公司
2017 年年度报告
2018 年
年 04 月
1

But still, character 年 is duplicated between "2018" and "04". I think you get the point.

Upvotes: 1

Related Questions