Reputation: 89

How to fix encoding: Identity-H error parsing text for Vietnamese IP Official Gazette PDF pdf with python?

I want to parse the pdf to text. But when I use pypdf2 or pymupdf to extract text from this pdf, I have a problem: It returns special characters when encountering accented words in Vietnamese. English or unsigned words don't matter.

#pdf path
pdf_file ='CB410A3 - Copy.pdf'
pdf = fitz.open(pdf_file)
#Read page 8
a8= pdf[8]
text = a8.getText("text")
text(Pymupdf code)

# pdf path
pdf_file =r'D:data\VN\CB410A3.pdf'
#import the PyPDF2 module
import PyPDF2

#open the PDF file
PDFfile = open(pdf_file, 'rb')

PDFfilereader = PyPDF2.PdfFileReader(PDFfile)

#provide the page number
pages = PDFfilereader.getPage(8)
x=pages.extractText()

It will return like: ' \nc«ng b¸o së h÷u c«ng nghiÖp sè 410 tËp a - QuyÓn 3 (05.2022) \n \n \n9 \ngia cÇm; ®å ¨n s¸ng trªn c¬ së c¸; ®å ¨n s¸ng trªn c¬ së h¶i s¶n; ®å ¨n s¸ng trªn c¬ së thÞt; \n®å ¨n s¸ng'. But I want it to return like this

I try to decode the results with utf-8 but it didn't work. Can someone help me solve this problem? Thanks.

Update infomation:

Starting from January 2023, the Industrial Property Official Gazette PDFs published by ipvietnam will no longer have encoding issues that may cause errors during parsing.

Upvotes: 1

Answers (2)

K J

Reputation: 11867

The OP link above does not function today for me so here is a similar built file from the same source (Note they use MSWord and normal Western Fonts such as Arial Calibri Cambria and Times Roman.ttf, thus not exotic or UTF-8).

File: vietnamese.pdf
Title: Microsoft Word - Heading_Nice10
Author: Administrator
Created: 2020-11-26 10:37:00
Modified: 2020-11-26 14:51:07
Application: Acrobat PDFMaker 11 for Word
PDF Producer: Adobe PDF Library 11.0
PDF Version: 1.6
PDF Optimizations: Tagged PDF
File Size: 431.4 KB (441,754 Bytes)
Number of Pages: 51
Page Size: 21.0 x 29.7 cm (A4)
   
Fonts: ArialMT (TrueType; Ansi)
Calibri (TrueType; Ansi; embedded)
Cambria (TrueType; Ansi; embedded)
TimesNewRomanPS-BoldItalicMT (TrueType (CID); Identity-H; embedded)
TimesNewRomanPS-BoldItalicMT (TrueType; Ansi)
TimesNewRomanPS-BoldMT (TrueType (CID); Identity-H; embedded)
TimesNewRomanPS-BoldMT (TrueType; Ansi)
TimesNewRomanPS-ItalicMT (TrueType (CID); Identity-H; embedded)
TimesNewRomanPS-ItalicMT (TrueType; Ansi)
TimesNewRomanPSMT (TrueType (CID); Identity-H; embedded)
TimesNewRomanPSMT (TrueType; Ansi)

CÔNG BÁO SỞ HỮU CÔNG NGHIỆP SỐ 392B (11/2021)

So normally there is no problem with the de-encoding of governmental Vietnamese PDFs

One of the simplest ways to extract text from a PDF is use the simple line command

pdftotext vietnamese.pdf -layout vietnamese.txt

Result

CÔNG BÁO SỞ HỮU CÔNG NGHIỆP SỐ 392B (11/2021)


             PHỤ LỤC 2


BẢNG PHÂN LOẠI QUỐC TẾ
HÀNG HÓA/DỊCH VỤ NI-XƠ


            Phiên bản 11-2021
      BỘ KHOA HỌC VÀ CÔNG NGHỆ
          CỤC SỞ HỮU TRÍ TUỆ
              -----------------

One minor problem may be some occasional word spacing may be odd and need minor adjustment or a slight change in command line options.

Later Edit

The OP file is accessible again and it is clearly faulty in parts where the CID mapping is incorrect as seen in OP question; it's not deliberate, simply poorly constructed. To correct the mapping in such cases is hard work needing many partial re-mappings.

C«ng b¸o së h÷u c«ng nghiÖp sè 411 tËp a - QUYÓN 1 (06.2022) 
M· Sè HAI CH÷ C¸I THÓ HIÖN T£N N¦íC Vμ C¸C THùC THÓ KH¸C TRONG 
C¸C T¦ LIÖU Së H÷U C¤NG NGHIÖP THEO TI£U CHUÈN ST3 CñA WIPO 

AE United Arab Emirates CN China HK Hong Kong

Upvotes: 2

Martin Thoma

Reputation: 136665

pypdf (and also PyPDF2) improved a lot. Especially for text extraction. Try it again with a recent version; it should work now.

See https://pypdf.readthedocs.io/en/latest/user/extract-text.html

from pypdf import PdfReader

reader = PdfReader("example.pdf")
for page in reader.pages:
    print(page.extract_text())

However, there are two cases where it will not work:

Images: pypdf is not OCR software. Try tesseract in this case
Scrambled PDFs: Some people want to prevent software from reading their PDFs. This seems to be the case for your PDF. Your best shot is to convert the PDF to an image and use OCR software in such cases (again: tesseract)

Upvotes: 1

How to fix encoding: Identity-H error parsing text for Vietnamese IP Official Gazette PDF pdf with python?

Update infomation:

Answers (2)

Later Edit

Related Questions