Reputation: 89
I want to parse the pdf to text. But when I use pypdf2 or pymupdf to extract text from this pdf, I have a problem: It returns special characters when encountering accented words in Vietnamese. English or unsigned words don't matter.
#pdf path
pdf_file ='CB410A3 - Copy.pdf'
pdf = fitz.open(pdf_file)
#Read page 8
a8= pdf[8]
text = a8.getText("text")
text(Pymupdf code)
Or
# pdf path
pdf_file =r'D:data\VN\CB410A3.pdf'
#import the PyPDF2 module
import PyPDF2
#open the PDF file
PDFfile = open(pdf_file, 'rb')
PDFfilereader = PyPDF2.PdfFileReader(PDFfile)
#provide the page number
pages = PDFfilereader.getPage(8)
x=pages.extractText()
It will return like: ' \nc«ng b¸o së h÷u c«ng nghiÖp sè 410 tËp a - QuyÓn 3 (05.2022) \n \n \n9 \ngia cÇm; ®å ¨n s¸ng trªn c¬ së c¸; ®å ¨n s¸ng trªn c¬ së h¶i s¶n; ®å ¨n s¸ng trªn c¬ së thÞt; \n®å ¨n s¸ng'. But I want it to return like this
I try to decode the results with utf-8 but it didn't work. Can someone help me solve this problem? Thanks.
Starting from January 2023, the Industrial Property Official Gazette PDFs published by ipvietnam will no longer have encoding issues that may cause errors during parsing.
Upvotes: 1
Views: 2349
Reputation: 11867
The OP link above does not function today for me so here is a similar built file from the same source (Note they use MSWord and normal Western Fonts such as Arial Calibri Cambria and Times Roman.ttf, thus not exotic or UTF-8).
File: vietnamese.pdf
Title: Microsoft Word - Heading_Nice10
Author: Administrator
Created: 2020-11-26 10:37:00
Modified: 2020-11-26 14:51:07
Application: Acrobat PDFMaker 11 for Word
PDF Producer: Adobe PDF Library 11.0
PDF Version: 1.6
PDF Optimizations: Tagged PDF
File Size: 431.4 KB (441,754 Bytes)
Number of Pages: 51
Page Size: 21.0 x 29.7 cm (A4)
Fonts: ArialMT (TrueType; Ansi)
Calibri (TrueType; Ansi; embedded)
Cambria (TrueType; Ansi; embedded)
TimesNewRomanPS-BoldItalicMT (TrueType (CID); Identity-H; embedded)
TimesNewRomanPS-BoldItalicMT (TrueType; Ansi)
TimesNewRomanPS-BoldMT (TrueType (CID); Identity-H; embedded)
TimesNewRomanPS-BoldMT (TrueType; Ansi)
TimesNewRomanPS-ItalicMT (TrueType (CID); Identity-H; embedded)
TimesNewRomanPS-ItalicMT (TrueType; Ansi)
TimesNewRomanPSMT (TrueType (CID); Identity-H; embedded)
TimesNewRomanPSMT (TrueType; Ansi)
CÔNG BÁO SỞ HỮU CÔNG NGHIỆP SỐ 392B (11/2021)
So normally there is no problem with the de-encoding of governmental Vietnamese PDFs
One of the simplest ways to extract text from a PDF is use the simple line command
pdftotext vietnamese.pdf -layout vietnamese.txt
Result
CÔNG BÁO SỞ HỮU CÔNG NGHIỆP SỐ 392B (11/2021)
PHỤ LỤC 2
BẢNG PHÂN LOẠI QUỐC TẾ
HÀNG HÓA/DỊCH VỤ NI-XƠ
Phiên bản 11-2021
BỘ KHOA HỌC VÀ CÔNG NGHỆ
CỤC SỞ HỮU TRÍ TUỆ
-----------------
One minor problem may be some occasional word spacing may be odd and need minor adjustment or a slight change in command line options.
The OP file is accessible again and it is clearly faulty in parts where the CID mapping is incorrect as seen in OP question; it's not deliberate, simply poorly constructed. To correct the mapping in such cases is hard work needing many partial re-mappings.
C«ng b¸o së h÷u c«ng nghiÖp sè 411 tËp a - QUYÓN 1 (06.2022)
M· Sè HAI CH÷ C¸I THÓ HIÖN T£N N¦íC Vμ C¸C THùC THÓ KH¸C TRONG
C¸C T¦ LIÖU Së H÷U C¤NG NGHIÖP THEO TI£U CHUÈN ST3 CñA WIPO
AE United Arab Emirates CN China HK Hong Kong
Upvotes: 2
Reputation: 136665
pypdf
(and also PyPDF2
) improved a lot. Especially for text extraction. Try it again with a recent version; it should work now.
See https://pypdf.readthedocs.io/en/latest/user/extract-text.html
from pypdf import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
print(page.extract_text())
However, there are two cases where it will not work:
pypdf
is not OCR software. Try tesseract in this caseUpvotes: 1