Reputation: 854
As the comments suggested, this work is difficult, so I want to solve it step by step to see the limitation. Firstly I will focus on the 1st question below.
Origin:
I want to replace text in PDF file for translation purpose, e.g. convert a English PDF into Chinese PDF.
My solution is:
Specifically, I implement IEventListener interface to get render info, and use this render info to find text with position rectangle.
But I encountered some questions:
Is there a better way to achieve my goal than current solution?
Or, any one can provide some suggestions with above questions?
UPDATED:
example of the 1st question:
I just record the text with their position encountered in render, and draw a rectangle around each text block. The code is:
Main in Main.java
PdfDocument pdfDoc = new PdfDocument(new PdfReader(srcFileName), new PdfWriter(destFileName));
SimplePositionalTextEventListener listener = new SimplePositionalTextEventListener();
new PdfCanvasProcessor(listener).processPageContent(pdfDoc.getFirstPage());
List<SimpleTextWithRectangle> result = listener.getResultantTextWithPosition();
int R = 0, G = 0, B = 0;
for(SimpleTextWithRectangle textWithRectangle: result) {
R += 40; R = R % 256;
G += 20; G = G % 256;
B += 80; B = B % 256;
PdfCanvas canvas = new PdfCanvas(pdfDoc.getPage(pageNumber));
canvas.setStrokeColor(new DeviceRgb(R, G, B));
canvas.rectangle(textWithRectangle.getRectangle());
canvas.stroke();
}
pdfDoc.close();
SimplePositionalTextEventListener.java(implements IEventListener
):
private List<SimpleTextWithRectangle> textWithRectangleList = new ArrayList<>();
private void renderText(TextRenderInfo renderInfo) {
if (renderInfo.getText().trim().length() == 0)
return;
LineSegment ascent = renderInfo.getAscentLine();
LineSegment descent = renderInfo.getDescentLine();
float initX = descent.getStartPoint().get(0);
float initY = descent.getStartPoint().get(1);
float endX = ascent.getEndPoint().get(0);
float endY = ascent.getEndPoint().get(1);
Rectangle rectangle = new Rectangle(initX, initY, endX - initX, endY - initY);
SimpleTextWithRectangle textWithRectangle = new SimpleTextWithRectangle(rectangle, renderInfo.getText());
textWithRectangleList.add(textWithRectangle);
}
public List<SimpleTextWithRectangle> getResultantTextWithPosition() {
return textWithRectangleList;
}
@Override
public void eventOccurred(IEventData data, EventType type) {
renderText((TextRenderInfo) data);
}
@Override
public Set<EventType> getSupportedEvents() {
return Collections.unmodifiableSet(new LinkedHashSet<>(Collections.singletonList(EventType.RENDER_TEXT)));
}
SimpleTextWithRectangle.java
private Rectangle rectangle;
private String text;
public SimpleTextWithRectangle(Rectangle rectangle, String text) {
this.rectangle = rectangle;
this.text = text;
}
public Rectangle getRectangle() {
return rectangle;
}
The file is: PDF file
After process, the header is:
As we can see, there are some hidden texts which can be found in render info, but invisible in PDF reader applications. And if we dig into each text block, we can see the
renderInfo.getText()
sometimes can not exactly match the text we saw in PDF.
After process, the footer is:
As we can see, the rectangle boundary can not fully cover the text, that is what I mentioned in question 1.
Upvotes: 3
Views: 7139
Reputation: 95918
The incorrect box coordinates are an effect of a bug in the iText 7 CMap handling.
When parsing the named Encoding CMap of a Type 0 font, e.g. GBK-EUC-H, the else
branch of this CMapEncoding
constructor is used:
public CMapEncoding(String cmap, String uniMap) {
this.cmap = cmap;
this.uniMap = uniMap;
if (cmap.equals(PdfEncodings.IDENTITY_H) || cmap.equals(PdfEncodings.IDENTITY_V)) {
cid2Uni = FontCache.getCid2UniCmap(uniMap);
isDirect = true;
this.codeSpaceRanges = IDENTITY_H_V_CODESPACE_RANGES;
} else {
cid2Code = FontCache.getCid2Byte(cmap);
code2Cid = cid2Code.getReversMap();
this.codeSpaceRanges = cid2Code.getCodeSpaceRanges();
}
}
Now FontCache.getCid2Byte(cmap)
uses a CMapCidByte
to build the mapping in:
public static CMapCidByte getCid2Byte(String cmap) {
CMapCidByte cidByte = new CMapCidByte();
return parseCmap(cmap, cidByte);
}
One peculiarity of CMapCidByte
(and probably other CMap classes) is that it stores the mapping inverse:
private Map<Integer, byte[]> map = new HashMap<>();
[...]
void addChar(String mark, CMapObject code) {
if (code.isNumber()) {
byte[] ser = decodeStringToByte(mark);
map.put((int)code.getValue(), ser);
}
}
Maybe it's done this way because the most often used lookup direction is the other way around. And this is ok as long as the original mapping is injective, i.e. all keys are mapped to different values.
Unfortunately CMaps do not need to be injective. E.g. for GBK-EUC-H we have cidrange entries
<21> <7e> 814
and
<aaa1> <aafe> 814
<aba1> <abc0> 908
When importing this encoding, therefore, the latter mappings overwrite many of the mappings of the character codes 0x21..0x7e.
In the document at hand there indeed is a font with encoding GBK-EUC-H used in the footer text. Thus, for this font many of the single-byte codes 0x21..0x7e are missing from iText's information about the font.
This range of codes encodes proportional Western characters in an otherwise monospaced font, in particular the alternative codes 0xaaa1..0xaafe and 0xaba1..0xabc0 encode the same Western characters as monospaced characters.
In the footer region of your example document these proportional Latin characters are used. Due to the missing mappings, these characters in some iText 7 code paths are replaced by replacement character symbol (e.g. text extraction itself does not return the Western characters but "�" instead), in some paths they are completely lost (e.g. when the length of the text chunks is calculated, these Western characters are ignored).
Therefore, the length of character chunks is calculated incorrectly and the bounding boxes, consequentially, are mis-sized and misplaced.
This also explains why the misplaced bounding boxes on each line start at the first occurance of Western characters on that line, and also why the most box size is missing on lines with the most Western characters.
Upvotes: 2