Reputation: 39
Goal: extract Chinese financial report text
Implementation: Python pdfplumber/pdfminer package to extract PDF text to txt
problem: for PDF text in bold, corresponding extracted text in txt duplicates
Examples are as follows:
Such as the following PDF text:
Python extracts to txt as:
And I don't need to repeat the text, just normal text.
How should I do it, should I change the package or add a new function?
Please see the code and original pdf text below.
Additional: pdfplumber code:
import pdfplumber
def pdf2txt(filename, delLinebreaker=True):
pageContent = ''
showplace = ''
try:
with pdfplumber.open( filename ) as pdf:
page_count = len(pdf.pages)
for page in pdf.pages:
if delLinebreaker==True:
pageContent += page.extract_text().replace('\n', "")
else:
pageContent += page.extract_text()
except Exception as e:
print( "file: ", filename, ', reason: ', repr(e) )
return pageContent
pdf2txt(r"report.pdf", delLinebreaker=False)
pdfminer code:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.pdfpage import PDFPage
rsrcmgr = PDFResourceManager()
outfp = open(r"report.txt", 'w', encoding='utf-8')
device = TextConverter(rsrcmgr, outfp)
with open(r"Report.pdf", 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
device.close()
outfp.close()
the pdf file can download here in Shenzhen Stock Exchange official website http://www.szse.cn/disclosure/listed/bulletinDetail/index.html?9324ce3c-6072-499d-8798-b25d641b52ec
Upvotes: 1
Views: 1508
Reputation: 643
With pdfplumber
, it is an known bug and have been fixed since Oct 4, 2020, and been added to release of version 0.5.24.
The problem is in pdfminer.six
, which is a core module of both pdfminer
and pdfplumber
.
Update your pdfplumber, then use
page.dedupe_chars().extract_text()
instead of
page.extract_text()
Upvotes: 0
Reputation: 3120
Using PyMuPDF, you are able to suppress pseudo-bold text like for example this:
import fitz # import PyMuPDF
doc = fitz.open("input.pdf")
page = doc[0] # example first page
# extract text including its coordinates
blocks = page.get_text("dict", sort=True, flags=fitz.TEXTFLAGS_TEXT)["blocks"]
old_bbox = fitz.EMPTY_RECT() # store previous bbox here
old_text = "" # store previous text here
for b in blocks: # loop over text blocks
for l in b["lines"]: # lines in current block
bbox = fitz.Rect(l["bbox"]) # line boundary box
# text in line - remove leading trailing spaces where possible
text = " ".join([s["text"].strip() for s in l["spans"]]).strip()
# check if new bbox overlaps old bbox
isect = abs(bbox & old_bbox) / abs(bbox) # overlap ratio
if text != old_text or isect < 0.5: # text unequal or no overlap
print(text) # print text
old_text = text # store for next
old_bbox = +bbox # store for next
Previous code delivers this:
浙江精功科技股份有限公司 2017 年年度报告全文
浙江精功科技股份有限公司
2017 年年度报告
年年度报告
2018 年
年
年 04 月
月
1
instead of this:
浙江精功科技股份有限公司 2017 年年度报告全文
浙江精功科技股份有限公司
浙江精功科技股份有限公司
浙江精功科技股份有限公司
浙江精功科技股份有限公司
2017 年年度报告
年年度报告
年年度报告
年年度报告
2018 年
年
年
年 04 月
月
月
月
1
As you can see, there still are some duplications - even with the corrective logic:
After 2017 年年度报告
follows 年年度报告
, which probably duplicates the Chinese part of the previous. So to also catch these cases, your logic needs to be smarter still and also check for partial bbox overlaps and trailing text equality, like if old_text.endswith(text) ...
. Doing this delivers a better result:
浙江精功科技股份有限公司 2017 年年度报告全文
浙江精功科技股份有限公司
2017 年年度报告
2018 年
年 04 月
1
But still, character 年 is duplicated between "2018" and "04". I think you get the point.
Upvotes: 1