Reputation: 769
We've encountered an issue with the rendering of Tamil letters in PDF viewers, where some letters are rendered differently than expected. Below, I've outlined the actual content rendering and the expected content for the reference:
Upon analysis, we've identified three cases that require reordering or substitution of glyphs during generation:
Reversing the glyphs
கெ = க + ெ = க ெ -> ெ + க = கெ
Spliting and Reordering the glyphs
கொ = க + ொ = க ொ -> க + ெ + ா -> ெ + க + ா = கொ
Substituting new glyphs for certain combinations. The new glyphe do not have unicode, only exist in the font file.
கு = க + ு = க ு -> கு
Input text | Char list from JDK | Code points from JDK | gid in ttf | Actual* | Expected | |
---|---|---|---|---|---|---|
கெ | க + ெ | 2965 3014 Character : க Codepoint : 2965 unicode : ub95 Character : ெ Codepoint : 3014 unicode : ubc6 | 1828 1856 | க + ெ = க ெ | ெ + க = கெ | Reversing the glyphes expected. |
கொ | க + ொ | 2965 3018 Character : க Codepoint : 2965 unicode : ub95 Character : ொ Codepoint : 3018 unicode : ubca | 1828 1859 | க + ொ = க ொ | க + ெ + ா ெ + க + ா = கொ | Split and reorder expected. |
கு | க + ு | 2965 3009 Character : க Codepoint : 2965 unicode : ub95 Character : ு Codepoint : 3009 unicode : ubc1 | 1828 1854 | க + ு = க ு | கு (gid = 6698) | New glyphe expected. The new glyphe do not have unicode, only exist in the font file. |
Looking at the GlyphSubstitutionTable, fontbox.cmap.Identity-H, fontbox.unicode.Scripts.txt. Any help would be appreciated to hadnle it in efficient way.
Links, Font Actual Expected Use cases PDFBox Jira
Upvotes: 1
Views: 623
Reputation: 5834
You need to implement a text shaping engine to handle Tamil writing.
Please see the OpenType specification: https://learn.microsoft.com/en-us/typography/opentype/spec/ , the GSUB/GPOS tables are the main interest for you.
This is no easy task so maybe using an external library such as HarfBuzz is a better choice.
There is also this PDFBox issue (4189) regarding Bengali writing. Maybe it will help you implement support for Tamil
Update: for example this HarfBuzz command line:
hb-shape -O json -u U+0B95,U+0BC1 --no-glyph-names FreeSerif.otf
will return:
[{"g":6698,"cl":0,"dx":0,"dy":0,"ax":858,"ay":0}]
You have to parse the json output, get the glyph ids and provide them to PDFBox.
Upvotes: 1