der_chirurg
der_chirurg

Reputation: 1525

C# Extract text from PDF using PdfSharp

Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don't want to use iTextSharp because of its license.

Upvotes: 68

Views: 72315

Answers (4)

Kaelidian
Kaelidian

Reputation: 1

Using this method I actually recently figured out how to do it for what you guys are calling unicode. But it's not exactly unicode, its PdfEncoding. Embedded Fonts causes the pdf to make differences tables called CMaps that you have to store and swap out the pdfEncoding unicode values, until you find one in the cmap table and put it there instead. I turned symbols into readable text and it took 3 weeks of learning about pdf file structures. You'll also need sharpZipLib to inflate the cmap tables as they are compressed.

Upvotes: -2

Sergio
Sergio

Reputation: 501

I have implemented it somehow similar to how David did it. Here is my code:

...
{
    // ....
    var page = document.Pages[1];
    CObject content = ContentReader.ReadContent(page);
    var extractedText = ExtractText(content);
    // ...
}

private IEnumerable<string> ExtractText(CObject cObject)
{
    var textList = new List<string>();
    if (cObject is COperator)
    {
        var cOperator = cObject as COperator;
        if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
            cOperator.OpCode.Name == OpCodeName.TJ.ToString())
        {
            foreach (var cOperand in cOperator.Operands)
            {
                textList.AddRange(ExtractText(cOperand));
            }
        }
    }
    else if (cObject is CSequence)
    {
        var cSequence = cObject as CSequence;
        foreach (var element in cSequence)
        {
            textList.AddRange(ExtractText(element));
        }
    }
    else if (cObject is CString)
    {
        var cString = cObject as CString;
        textList.Add(cString.Value);
    }
    return textList;
}

Upvotes: 23

Ronnie Overby
Ronnie Overby

Reputation: 46460

Took Sergio's answer and made some extension methods. I also changed the accumulation of strings into an iterator.

public static class PdfSharpExtensions
{
    public static IEnumerable<string> ExtractText(this PdfPage page)
    {       
        var content = ContentReader.ReadContent(page);      
        var text = content.ExtractText();
        return text;
    }   

    public static IEnumerable<string> ExtractText(this CObject cObject)
    {   
        if (cObject is COperator)
        {
            var cOperator = cObject as COperator;
            if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
                cOperator.OpCode.Name == OpCodeName.TJ.ToString())
            {
                foreach (var cOperand in cOperator.Operands)
                    foreach (var txt in ExtractText(cOperand))
                        yield return txt;   
            }
        }
        else if (cObject is CSequence)
        {
            var cSequence = cObject as CSequence;
            foreach (var element in cSequence)
                foreach (var txt in ExtractText(element))
                    yield return txt;
        }
        else if (cObject is CString)
        {
            var cString = cObject as CString;
            yield return cString.Value;
        }
    }
}

Upvotes: 58

David Schmitt
David Schmitt

Reputation: 59316

PDFSharp provides all the tools to extract the text from a PDF. Use the ContentReader class to access the commands within each page and extract the strings from TJ/Tj operators.

I've uploaded a simple implementation to github.

Upvotes: 14

Related Questions