Jeyan
Jeyan

Reputation: 769

PDF Tamil writing using PDFBox

We've encountered an issue with the rendering of Tamil letters in PDF viewers, where some letters are rendered differently than expected. Below, I've outlined the actual content rendering and the expected content for the reference:

Actual content rendering enter image description here

Expected content enter image description here

Upon analysis, we've identified three cases that require reordering or substitution of glyphs during generation:

Reversing the glyphs

        கெ = க + ெ =  க ெ  ->  ெ + க = கெ 

Spliting and Reordering the glyphs

        கொ = க + ொ  = க ொ  ->    க + ெ + ா  ->  ெ + க + ா = கொ
                                    

Substituting new glyphs for certain combinations. The new glyphe do not have unicode, only exist in the font file.

        கு = க + ு = க ு -> கு            
Input text Char list from JDK Code points from JDK gid in ttf Actual* Expected
கெ க + ெ 2965 3014 Character : க Codepoint : 2965 unicode : ub95 Character : ெ Codepoint : 3014 unicode : ubc6 1828 1856 க + ெ = க ெ ெ + க = கெ Reversing the glyphes expected.
கொ க + ொ 2965 3018 Character : க Codepoint : 2965 unicode : ub95 Character : ொ Codepoint : 3018 unicode : ubca 1828 1859 க + ொ = க ொ க + ெ + ா ெ + க + ா = கொ Split and reorder expected.
கு க + ு 2965 3009 Character : க Codepoint : 2965 unicode : ub95 Character : ு Codepoint : 3009 unicode : ubc1 1828 1854 க + ு = க ு கு (gid = 6698) New glyphe expected. The new glyphe do not have unicode, only exist in the font file.

Looking at the GlyphSubstitutionTable, fontbox.cmap.Identity-H, fontbox.unicode.Scripts.txt. Any help would be appreciated to hadnle it in efficient way.

Links, Font Actual Expected Use cases PDFBox Jira

Upvotes: 1

Views: 623

Answers (1)

iPDFdev
iPDFdev

Reputation: 5834

You need to implement a text shaping engine to handle Tamil writing.

Please see the OpenType specification: https://learn.microsoft.com/en-us/typography/opentype/spec/ , the GSUB/GPOS tables are the main interest for you.

This is no easy task so maybe using an external library such as HarfBuzz is a better choice.

There is also this PDFBox issue (4189) regarding Bengali writing. Maybe it will help you implement support for Tamil

Update: for example this HarfBuzz command line:

hb-shape -O json -u U+0B95,U+0BC1 --no-glyph-names FreeSerif.otf

will return:

[{"g":6698,"cl":0,"dx":0,"dy":0,"ax":858,"ay":0}]

You have to parse the json output, get the glyph ids and provide them to PDFBox.

Upvotes: 1

Related Questions