Reputation: 164
I need to read a pdf file in my C# program. The file is persian. I use code below. It works fine when the font is Tahoma for example, but when the font is persian it doesn't work. How can I add persian fonts to itextsharp when reading pdf?
An example of persian PDF: http://uplod.ir/idqrbqzzwl34/Visual_C__2005_Learning_(hashemian_).pdf.htm persian pdf is right to left but when with itextsharp text extracted, it is left to right and it is unreadable.
Upvotes: 1
Views: 723
Reputation: 77606
Your question is completely wrong and so is your comment to the other answer you received. You are assuming that extracted text has "a font". It hasn't. What you extract are bytes in a specific encoding (e.g. UTF-8).
Please watch this movie: https://www.youtube.com/watch?v=wxGEEv7ibHE
Text content in a PDF is stored as a sequence of characters. These characters are mapped to glyphs. E.g. the character a
can be mapped to glyph such as "a", "a", "a" or any other glyph including b
or c
. It's just "a code" that is used to find the instructions needed to draw the letter on the page.
What you need is another mapping. You need to find the mapping between the "character" that is used as a code in the content stream and the UNICODE character it represents. There should be a ToUnicode mapping in your PDF, but... as you can see in the video I mention, not all PDFs have this mapping.
The best way to check if the text in your PDF can be extracted, is by copy/pasting text from Adobe Reader. If you succeed, you should be able to extract text programmatically; if you don't, you need to start looking for an OCR solution.
Update: I have downloaded your PDF and I've extracted the text. I don't see what is missing. Unfortunately I can't copy/paste the text here because the body of an answer is limited to 30000 characters.
Upvotes: 1