Reputation: 81
I'm using pdfSharp and some modified methods I found online to return text from pdfs. However, depending on how the pdf was created rather than returning the text correctly the methods will return strings like '\0\u0019' or '\0\u0013', and render those in the console window as various shapes and special characters. I'm assuming this is because of how the pdf was originally created and might be related to the text encoding.
I've tried a number encoding conversions that I've found online, without any success. I'm not really familiar with unicode ascii etc.. any suggestion on how I might be able to return the text correctly? Here are the methods I'm using below to extract the text from the pdf.
using PdfSharp.Pdf.Content;
using PdfSharp.Pdf.Content.Objects;
using PdfSharp.Pdf.IO;
using System;
using System.Collections.Generic;
using System.Data;
using System.Linq;
namespace Job_Ingestor
{
public static class PdfSharpExtensions
{
public static string ExtractTextByRow(PdfDocument doc, int pageIndex = 0)
{
string rtnTxt = string.Empty;
PdfPage page = doc.Pages[pageIndex];
CObject content = ContentReader.ReadContent(page);
var extractedText = PdfSharpExtensions.ExtractText(content);
foreach (var t in extractedText)
{
rtnTxt = rtnTxt + t;
}
return rtnTxt;
}
public static IEnumerable<string> ExtractText(this PdfPage page)
{
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
return text;
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (string txt in ExtractText(cOperand))
{
yield return txt;
}
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
{
yield return txt;
}
}
else if (cObject is CString)
{
var cString = cObject as CString;
yield return cString.Value;
}
}
Upvotes: -1
Views: 335
Reputation: 21689
The Tj
commands in PDF files sometimes work with glyph IDs. Accessible PDF files have a table that maps glyph IDs to Unicode characters.
Upvotes: 0