insert a NULL character with PDFBox

Question

Let us consider this code:

public class Test1{

    public static void CreatePdf(String src) throws IOException, COSVisitorException{
    PDRectangle rec= new PDRectangle(400,400);
    PDDocument document= null;
    document = new PDDocument();
    PDPage page = new PDPage(rec);
    document.addPage(page);
    PDDocumentInformation info=document.getDocumentInformation();
 PDStream stream= new PDStream(document);
    info.setAuthor("PdfBox");
    info.setCreator("Pdf");
    info.setSubject("Stéganographie");
    info.setTitle("Stéganographie dans les documents PDF");
    info.setKeywords("Stéganographie, pdf");
    content= new PDPageContentStream(document, page, true, false );
    font= PDType1Font.HELVETICA;

String hex = "4C0061f";  // shows "La"
//Notice that we have 00 between 4C and 61 where 00 =null character


       StringBuilder sb = new StringBuilder();
        for (int count = 0; count < hex.length() - 1; count += 2)
    {
        String output = hex.substring(count, (count + 2));
        int decimal = Integer.parseInt(output, 16);
        StringBuilder ae= sb.append((char)decimal);
    }
        String tt=sb.toString();
    content.beginText();
    content.setFont(font, 12);
    content.appendRawCommands("15 385 Td
");
   content.appendRawCommands("("+tt+")"+"Tj
");
    content.endText();
   content.close();
    document.save("doc.pdf");
    document.close();       
    }

My problem is: why the "00" is replaced by a space in the PDF document not as a null character? Notice that I got the width 0.0 for this null character, but it shows as a space in the PDF document! Therefore I get: "L a" instead of "La"

mkl · Accepted Answer

why the "00" is replaced by a space in the PDF document not as a null character?

If you look into your PDF you'll find that the font used for your text is defined as:

9 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont /Helvetica
/Encoding /WinAnsiEncoding
>>
endobj

Thus, you use a font with WinAnsiEncoding. If you look at the definition of that encoding in Annex D of the PDF specification, you see that no code below 32 (decimal) is mapped to anything. Thus, what you are trying to do is use a character undefined in the encoding at hand. Thus, the behavior is not defined; Acrobat Reader seems to use a positive width for those undefined code points.

If you want to make sure your hidden characters don't cause any displacement at all, you should add an explicit array of widths in your font dictionary, cf. section 9.6.2 in the PDF specification, and make sure your invisible characters get a width of 0. (BTW, here you'll also see that not embedding a widths array - as PDFBox does - has been deprecated anyways years ago).

Notice that i got the width 0.0 for this null character

As soon as you are in undefined ranges, anything might happen and different programs have different assumptions.

PS Some code... Between your lines

font= PDType1Font.HELVETICA;

and

String hex = "4C0061f";  // shows "La"

I added the following code:

InputStream afmStream = ResourceLoader.loadResource("org/apache/pdfbox/resources/afm/Helvetica.afm");
AFMParser afmParser = new AFMParser(afmStream);
afmParser.parse();
FontMetric afmMetrics = afmParser.getResult();
List newWidths = new ArrayList();
for (CharMetric charMetric : afmMetrics.getCharMetrics())
{
    if (charMetric.getCharacterCode() < 0)
        continue;
    while (charMetric.getCharacterCode() >= newWidths.size())
        newWidths.add(0f);
    newWidths.set(charMetric.getCharacterCode(), charMetric.getWx());
}
font.setFirstChar(0);
font.setLastChar(newWidths.size() - 1);
font.setWidths(newWidths);

This code should read the Helvetica.afm font metrics resource included in PDFBox and create FirstChar, LastChar, and Widths entries from it. It works here alright, but if it doesn't in your installation, simply extract the afm file from the PDFBox jars and read it using a FileInputStream.

For some reason the 00 character still seems to think it has some width, but other characters below 32 (decimal) can be used alright, e.g.

String hex = "4C0461f";

shows "La" without a gap. If I interpret your former (now deleted) question concerning 1C and 1D correctly, this already would help you continue.

PPS: Concerning the question in the comments:

can you tell me the all disadvantages of this method ? and why this method does not match with accent characters, for example (Lé), your code match only with characters without accent , but when we have accent, we get L é instead of Le..I want to know only what are the disadvantages of your code :)

I cannot tell all (because I'm really not that deep into font matters) but in essence the approach described above is somewhat incomplete.

As mentioned at the start, you use a font with WinAnsiEncoding in which no code below 32 (decimal) is mapped to anything. By adding FirstChar, LastChar, and Widths entries, we tried to define a zero width for those characters with codes below 32.

In spite of all that, though, we neither cared about encoding information for those codes (the encoding remained a pure WinAnsiEncoding) nor did we consider whether the font actually contained any information for those codes. Furthermore, making things still less controllable, we are talking about Helvetica, i.e. one of the standard 14 fonts about which PDF browsers have to bring along their own information anyways. Wherever the explicitly given information and the information the viewer brings along contradict, PDF viewers might be tempted to be biased towards their own information.

Why there is trouble especially with accented characters? I'm not sure. I would guess, though, that is related to the fact that fonts usually don't bring along accented characters as separate entities but instead combine an accent and an unaccented character. Maybe internally the font the viewer uses has some information for such combined characters mapped at those code points below 32 and, therefore, the display becomes quirky when your explicit codes below 32 and the font's implicit use of such codes happen side by side.

Essentially I generally would advise against doing things like this. For normal PDF documents it is not necessary at all.

In your case though, as you have titled your document Stéganographie dans les documents PDF, you obviously do want to somehow hide information in PDFs. Using invisible, unprintable characters seems one approach for that; thus, it is ok that you experiment in that direction. But PDF does offer many more ways to put any amount of information into a PDF without it being directly visible.

Depending on your specific aim, therefore, I would think that other approaches might hide the information more securely, e.g. private PieceInfo sections or custom tags in some other dictionaries...

insert a NULL character with PDFBox

Answers (2)

Related Questions