Joe Urc
Joe Urc

Reputation: 487

Reading and writing PDF files with ligatures?

I am attempting to read text from a PDF file, and then later on, write that same text back to another PDF using Python. After the text is read in, the representation of the string when I print it to the console is:

Officially, it’s called

However, when I print the repr() of this text string, I see:

O\xef\xac\x83cially, it\xe2\x80\x99s called

This makes plenty of sense to me - these are ligatures of symbols from the PDFs i.e. \xef\xac\x83 represents a ligature for 'ff'. The problem is that when I write this string to PDF, using reportlab libraries, the PDFs have black symbols in place, as seen below:

enter image description here

This only happens with certain ligatures. I am wondering what I can do so that the string I write to the PDF does not contain these ligatures or if there is an efficient way to replace all of them.

Upvotes: 0

Views: 1536

Answers (2)

Joe Urc
Joe Urc

Reputation: 487

What I ended up doing was copying the characters out of my text file and doing a .replace on them. ie str.replace('ff','ff') - if this looks the same, it's the same. The param on the left is the ligature character and the param on the right is two f's. Also, don't forget # -- coding: utf-8 -- .

Upvotes: 0

Jongware
Jongware

Reputation: 22457

It appears your input is correct, but to see the character in your output, use a font that does have one. The font you are using here is bog standard Arial, which does not contain it.

Some suggestions (mainly depending on your platform, but some of these are Open Source):

  • Arial Unicode MS
  • Lucida Grande
  • Calibri
  • Cambria
  • Corbel
  • Droid Sans/Droid Serif
  • Helvetica Neue
  • Ubuntu

If you don't want, or are not able, to change the font, replace the sequence \xef\xac\x83 with the plain characters ffi in your program before writing text to PDF. (And similar for those other certain ligatures you mentioned.)

Upvotes: 1

Related Questions