Reputation: 59
Adobe Acrobat Pro "Content View" display character normal, but when i copy and paste, they are invalid.but if "copy with formating",it will be normal.bad case image
eg the first letter"重",bad case pdf file when i use pdfbox to extract letters,some warning alert.
一月 08, 2021 11:14:37 上午 org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
警告: No Unicode mapping for CID+18429 (18429) in font GVAQVQ+SimSun
PDFont.loadUnicodeCmap() for there is no ToUnicode CMap entry in font GVAQVQ+SimSun,so PDType0Font.toUnicodeCMap is null. so when call PDFont.toUnicode(),it return null.
@mkl If there are some way to sovle this problem.Thanks in advance.
PDType0Font/null, PostScript name: GVAQVQ+SimSun
0 = {SmallMap$SmallMapEntry@2123} "COSName{BaseFont}" -> "COSName{GVAQVQ+SimSun}"
1 = {SmallMap$SmallMapEntry@2124} "COSName{DescendantFonts}" -> "COSArray{[COSDictionary{COSName{BaseFont}:COSName{GVAQVQ+SimSun};COSName{CIDSystemInfo}:COSDictionary{COSName{Ordering}:COSString{Identity};COSName{Registry}:COSString{PDFXC30};COSName{Supplement}:COSInt{0};};COSName{DW}:COSInt{1000};COSName{FontDescriptor}:COSObject{COSDictionary{COSName{Ascent}:COSInt{859};COSName{AvgWidth}:COSInt{500};COSName{CapHeight}:COSInt{668};COSName{Descent}:COSInt{-141};COSName{Flags}:COSInt{32};COSName{FontBBox}:COSArray{COSInt{-8};COSInt{-145};1000;859;};COSName{FontFile2}:COSObject{COSDictionary{COSName{Length}:COSInt{175201};COSName{Filter}:COSArray{COSName{FlateDecode};};COSName{Length1}:COSInt{468544};}COSStream{-708342007}};COSName{FontName}:-120083354;COSName{ItalicAngle}:0;COSName{Leading}:COSInt{141};COSName{MaxWidth}:1000;COSName{MissingWidth}:500;COSName{StemH}:COSInt{70};COSName{StemV}:70;COSName{Type}:COSName{FontDescriptor};COSName{XHeight}:COSInt{438};}};COSName{Subtype}:COSName{CIDFontType2};COSName{Type}:COSNa
2 = {SmallMap$SmallMapEntry@2125} "COSName{Encoding}" -> "COSName{Identity-H}"
3 = {SmallMap$SmallMapEntry@2126} "COSName{Subtype}" -> "COSName{Type0}"
4 = {SmallMap$SmallMapEntry@2127} "COSName{Type}" -> "COSName{Font}"
"COSName{FontDescriptor}" -> "COSObject{15, 0}"
key = {COSName@2168} "COSName{FontDescriptor}"
value = {COSObject@2169} "COSObject{15, 0}"
baseObject = {COSDictionary@2209} "COSDictionary{COSName{Ascent}:COSInt{859};COSName{AvgWidth}:COSInt{500};COSName{CapHeight}:COSInt{668};COSName{Descent}:COSInt{-141};COSName{Flags}:COSInt{32};COSName{FontBBox}:COSArray{COSInt{-8};COSInt{-145};COSInt{1000};859;};COSName{FontFile2}:COSObject{COSDictionary{COSName{Length}:COSInt{175201};COSName{Filter}:COSArray{COSName{FlateDecode};};COSName{Length1}:COSInt{468544};}COSStream{-708342007}};COSName{FontName}:COSName{GVAQVQ+SimSun};COSName{ItalicAngle}:COSInt{0};COSName{Leading}:COSInt{141};COSName{MaxWidth}:1000;COSName{MissingWidth}:500;COSName{StemH}:COSInt{70};COSName{StemV}:70;COSName{Type}:COSName{FontDescriptor};COSName{XHeight}:COSInt{438};}"
needToBeUpdated = false
items = {SmallMap@2211} size = 16
0 = {SmallMap$SmallMapEntry@2214} "COSName{Ascent}" -> "COSInt{859}"
1 = {SmallMap$SmallMapEntry@2215} "COSName{AvgWidth}" -> "COSInt{500}"
2 = {SmallMap$SmallMapEntry@2216} "COSName{CapHeight}" -> "COSInt{668}"
3 = {SmallMap$SmallMapEntry@2217} "COSName{Descent}" -> "COSInt{-141}"
4 = {SmallMap$SmallMapEntry@2218} "COSName{Flags}" -> "COSInt{32}"
5 = {SmallMap$SmallMapEntry@2219} "COSName{FontBBox}" -> "COSArray{[COSInt{-8}, COSInt{-145}, COSInt{1000}, COSInt{859}]}"
6 = {SmallMap$SmallMapEntry@2220} "COSName{FontFile2}" -> "COSObject{12, 0}"
7 = {SmallMap$SmallMapEntry@2221} "COSName{FontName}" -> "COSName{GVAQVQ+SimSun}"
8 = {SmallMap$SmallMapEntry@2222} "COSName{ItalicAngle}" -> "COSInt{0}"
9 = {SmallMap$SmallMapEntry@2223} "COSName{Leading}" -> "COSInt{141}"
10 = {SmallMap$SmallMapEntry@2224} "COSName{MaxWidth}" -> "COSInt{1000}"
11 = {SmallMap$SmallMapEntry@2225} "COSName{MissingWidth}" -> "COSInt{500}"
12 = {SmallMap$SmallMapEntry@2226} "COSName{StemH}" -> "COSInt{70}"
13 = {SmallMap$SmallMapEntry@2227} "COSName{StemV}" -> "COSInt{70}"
14 = {SmallMap$SmallMapEntry@2228} "COSName{Type}" -> "COSName{FontDescriptor}"
15 = {SmallMap$SmallMapEntry@2229} "COSName{XHeight}" -> "COSInt{438}"
Upvotes: 1
Views: 2463
Reputation: 95918
PDFBox text extraction works according to Algorithm presented in section 9.10.2 "Mapping Character Codes to Unicode Values" of the PDF specification ISO 32000-1. When trying to apply this algorithm to your file, it fails to extract the text drawn with the SimSun font embedded subset (F2):
Thus, text extraction as implemented in PDFBox cannot extract that Chinese text.
An alternative source for text information during text extraction presented in the PDF specification are ActualText entries for structure elements or marked-content sequences. But your PDF does not have any such ActualText entries either.
Thus, Adobe Acrobat copy&paste (which uses a combination of the algorithm mentioned before and ActualText analysis) cannot extract that Chinese text.
So "copy with formating" in Adobe Acrobat Pro apparently must use some information beyond those mechanisms proposed by the PDF specification.
Inspecting the embedded font resource itself one can see that it neither contains own mappings to Unicode nor any standard names. It is notable, though, that the glyph numbers are not consecutively numbered but have gaps. Probably these numbers have been retained from the full font during subsetting.
Adobe Acrobat Pro, therefore, appears to do either of the following options during "copy with formating" of your Chinese text:
Googling around for the PDFXC30-Identity character collection one sees that there are numerous text extraction tools having issues with it, e.g. on the Aspose forums one can read:
Our team has looked into this issue and I would like to share with you that the software you used to create the sample PDF files used PDFXC30 character collection. This character collection is not standard and we don’t have any information about this encoding. This makes correct text extraction impossible at the moment.
(shahzadlatif most recent response in the PdfExtractor encoding issue thread)
If you can provide PDFXC30 character collection mapping files from a trustable source, PDFBox development may include them into PDFBox to enable text extraction for files like yours.
Upvotes: 3