Ruby_Beginner
Ruby_Beginner

Reputation: 99

Ruby extract arabic text from PDF

I usually use this code to extract text from PDFs:

require 'rubygems'
require 'pdf/reader'

filename = File.expand_path(File.dirname(__FILE__)) + "/myfile.pdf"

PDF::Reader.open(filename) do |reader|
reader.pages.each do |page|
puts page.text
  end
end

This time I want to parse an Arabic PDF, but, using this code, I get a bunch of weird characters. For example: ±πNuô ≠ö ¥πbËÊ ´Lö Ë«_°u«» ±GKIW √±U±Nr ËîUÅW √Ê ´bœ Ë≠w «∞LπLuŸ, ¥L

I have already read that coding: utf-8 is fine for Arabic, so, is there any solution?

Upvotes: 2

Views: 1282

Answers (1)

Jongware
Jongware

Reputation: 22457

The text in this PDF is not properly encoded: the relation between what appears on the screen and what character code it represents is not stored in this PDF. That's why you get 'random' text.

character definitions

Also notable: the text appears in the correct order, but that is because the font characters are drawn mirrored and the text itself is also drawn mirrored:

characters drawn in mirrored shapes

-- a typical hack-ish workaround to properly typeset Arabic using Quark XPress (there used to be an XTension (sp.?) that 'enabled' this).

As it seems this wrong encoding is actually defined as such inside the fonts ("Font uses built-in encoding", according to Acrobat Pro's "Inventory" function), you might be able to find a translation table between the characters you are reading and what they actually should be. Be aware that these tables may very well differ for each of the fonts in this document, so you have to check what font each of your text strings is using.


Addition

I did some further investigations, and they agree with your own, and Acrobat Pro's, findings. Your sample text appears like this:

/F1 1 Tf        % set font and size "HGKECF+PHBagdad"
...
[ (´Mb ) -24.4 (¢'b¥b ) -24.4 («®{05}d«ØU¢Nr, ) -24.4 (Ë«ù´öÂ ) -24.4 (°LDU{03}&Nr.) ] TJ

Usually, font entries in a PDF contain a table that 'translates' into actual character codes. That is also true for this font (and all others):

<<
  /Type     /Font
  /Subtype  /Type1
  /BaseFont     /HGKECF+PHBagdad
  /Encoding     66 0 R
  /ToUnicode    69 0 R
>>

(only relevant entries listed). The /Encoding entry points to a simple array of index > character codes list, and /ToUnicode to a more formal table, which essentially contains the same. Both lists translate to the same text.

As you can see in the top image, the font contains Arabic glyphs (mirrored), but the code linked to these glyphs is not the correct one for Arabic. It's like the old "Symbol" font hack: type 'a' to get an alpha, 'b' for a beta, 'g' for a gamma: text on your screen appears to be "ɑβɣ" but in truth it says "abg".


Addition 2

See also this Adobe Forum thread: Arabic - ToUnicode Map incorrect?

Quote:

Arabic XT fonts are not Arabic fonts from the operating system point of view (MacOS or Windows). They use the Mac Roman encoding; the Arabic glyphs are placed in place of the Roman glyphs.

I tried to find a "correcting" encoding for your fonts but have this far not been successful. If I could locate a translation table, it ought to be possible to exchange the existing /ToUnicode table with a corrected one, and you'd get the correct text when extracting. (Although it may be simpler to use the same table to change the text strings after extraction in your programming language of choice.)

Upvotes: 4

Related Questions