PdfTextExtractor.GetTextFromPage is not returning correct text

Question

Using iTextSharp, I have the following code, that successfully pulls out the PDF's text for the majority of PDF's I'm trying to read...

PdfReader reader = new PdfReader(fileName);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
    text += PdfTextExtractor.GetTextFromPage(reader, i);
}
reader.Close();

However, some of my PDF's have XFA forms (which have already been filled out), and this causes the 'text' field to be filled with the following garbage...

"Please wait... 
  
If this message is not eventually replaced by the proper contents of the document, your PDF 
viewer may not be able to display this type of document. 
  
You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by 
visiting  http://www.adobe.com/products/acrobat/readstep2.html. 
  
For more assistance with Adobe Reader visit  http://www.adobe.com/support/products/
acrreader.html. 
  
Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark 
of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other 
countries."

How can I work around this? I tried using the PdfStamper[1] from iTextSharp to flatten the PDF, but that didn't work - the resultant stream had the same garbage text.

[1]How to flatten already filled out PDF form using iTextSharp

Bruno Lowagie · Accepted Answer

You are confronted with a PDF that acts as a container for an XML stream. This XML stream is based on the XML Forms Architecture (XFA). The message you see, is not garbage! It is the message contained in a PDF page that is shown when opening the document in a Viewer that reads the file as if it were ordinary PDF.

For instance: if you open the document in Apple Preview, you will see the exact same message, because Apple Preview is not able to render an XFA form. It should not surprise you that you get this message when parsing the PDF contained in your file using iText. That is exactly the PDF content that is present in your file. The content you see when opening the document in Adobe Reader isn't stored in PDF syntax, it is stored as an XML stream.

You say that you've tried to flatten the PDF as described in the answer to the question How to flatten already filled out PDF form using iTextSharp. However, that question is about flattening a form based on AcroForm technology. It is not supposed to work with XFA forms. If you want to flatten an XFA form, you need to use XFA Worker on top of iText:

[JAVA]

Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(dest));
XFAFlattener xfaf = new XFAFlattener(document, writer);
xfaf.flatten(new PdfReader(baos.toByteArray()));
document.close();

[C#]

Document document = new Document();
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(dest, FileMode.Create));
XFAFlattener xfaf = new XFAFlattener(document, writer);
ms.Position = 0;
xfaf.Flatten(new PdfReader(ms));
document.Close();

The result of this flattening process is an ordinary PDF that can be parsed by your original code.

PdfTextExtractor.GetTextFromPage is not returning correct text

Answers (1)

Related Questions