Reputation: 93
I am looking to extract all different font names of the text in PDF file. I am using iTextSharp DLL, and below given is my code.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;
namespace GetFontName
{
class Program
{
static void Main(string[] args)
{
PdfReader reader = new PdfReader("C:/Users/agnihotri/Downloads/Test.pdf");
HashSet<String> names = new HashSet<string>();
PdfDictionary resources;
for (int p = 1; p <= reader.NumberOfPages; p++)
{
PdfDictionary dic = reader.GetPageN(p);
resources = dic.GetAsDict(PdfName.RESOURCES);
if (resources != null)
{
//gets fonts dictionary
PdfDictionary fonts = resources.GetAsDict(PdfName.FONT);
if (fonts != null)
{
PdfDictionary font;
foreach (PdfName key in fonts.Keys)
{
font = fonts.GetAsDict(key);
string name = font.GetAsName(iTextSharp.text.pdf.PdfName.BASEFONT).ToString();
//check for prefix subsetted font
if (name.Length > 8 && name.ToCharArray()[7] == '+')
{
name = String.Format("%s subset (%s)", name.Substring(8), name.Substring(1, 7));
}
else
{
//get type of fully embedded fonts
name = name.Substring(1);
PdfDictionary desc = font.GetAsDict(PdfName.FONTDESCRIPTOR);
if (desc == null)
name += "no font descriptor";
else if (desc.Get(PdfName.FONTFILE) != null)
name += "(Type1) embedded";
else if (desc.Get(PdfName.FONTFILE2) != null)
name += "(TrueType) embedded ";
else if (desc.Get(PdfName.FONTFILE3) != null)
name += name;//("+font.GetASName(PdfName.SUBTYPE).ToString().SubSTring(1)+")embedded';
}
names.Add(name);
}
}
}
}
var collections = from name in names
select name;
foreach (string fname in collections)
{
Console.WriteLine(fname);
}
Console.Read();
}
}
}
The output I am getting is "Glyphless Font" no font descriptor" for every pdf file as input. The link for input file is as follows:
https://drive.google.com/open?id=0B6tD8gqVZtLiM3NYMmVVVllNcWc
Upvotes: 2
Views: 4561
Reputation: 96064
Running your code with minimal changes I get as output
%s subset (%s)
Actually %s
looks like a Java format string, not a .Net format string. Using the more .Net'ish format string {0} subset ({1})
I get
LiberationMono subset (BAAAAA+)
I would propose you use backslashes and the @"..."
string form instead of slashes in a file path, e.g. like this
PdfReader reader = new PdfReader(@"C:\Users\agnihotri\Downloads\Test.pdf");
and double check the file name and path --- after all the file you provided is named Hello_World.pdf
.
Upvotes: 2
Reputation: 77606
I've opened your PDF in Adobe Acrobat and I look at the font panel. This is what I saw:
You have an embedded SubSet of LiberationMono, which means that the name of the font will be stored in the file as ABCDEF+LiberationMono (where ABCDEF is a series of 6 random, but unique characters) because the font is subsetter. See What are the extra characters in the font name of my PDF?
Now let's take a look at the same file opened in iText RUPS:
We find the /Font
object and it has a /FontDescriptor
. In the /FontDescriptor
, we find the /FontName
in the format we expected: BAAAAA+LiberationMono
.
Now that you know where to look for that name, you can adapt your code.
Upvotes: 3