Lee
Lee

Reputation: 11

iTextSharp GetTextFromPage does not return

This pertains to using iTextSharp 5.5.8 or 5.5.9, my test harness is:

{
  PdfReader pdfReader = null;
  StringBuilder actual = new StringBuilder();

  try
  {
    pdfReader = new PdfReader(@"Quotation for Macbook 6-16.pdf");
  }
  catch (iTextSharp.text.exceptions.BadPasswordException bpe)
  {
    actual.AppendLine(string.Format("Exception: Bad Password {0}", bpe));
  }
  catch (Exception ex)
  {
    actual.AppendLine(string.Format("Exception: PDFReader {0}", ex));
  }

  int pages = pdfReader.NumberOfPages;
  for (int page = 1; page <= pages; page++)
  {
    try
    {
      String s = PdfTextExtractor.GetTextFromPage(pdfReader, page);
      actual.AppendLine(string.Format("{0}", s));
    }
    catch (Exception ex)
    {
      actual.AppendLine(string.Format("Exception PDF Page {0}: {1}", page, ex));
    }
  }

  foreach (var field in pdfReader.AcroFields.Fields)
  {
    actual.AppendLine(string.Format("{0}: {1}", field.Key, pdfReader.AcroFields.GetField(field.Key)));
  }
}

I have processed thousands of PDF files calling the GetTextFromPage, but encountered a particular PDF that does not return at all. I downloaded the code from GitHub and walked through it processing the file and it looks like the conditions for the LineDashPattern when it calls InitFirst cause the continuous loop here is the code from LineDashPattern.cs

        private void InitFirst(float phase) {
        if (dashArray.Size > 0) {
            while (phase > 0) {
                phase -= dashArray.GetAsNumber(currentIndex).FloatValue;
                currentIndex = (currentIndex + 1) % DashArray.Size;
                elemOrdinalNumber++;
            }

            if (phase < 0) {
                --elemOrdinalNumber;
                --currentIndex;
                currentElem = new DashArrayElem(-phase, IsEven(elemOrdinalNumber));
            } else {
                currentElem = new DashArrayElem(dashArray.GetAsNumber(currentIndex).FloatValue, 
                    IsEven(elemOrdinalNumber));
            }
        }
    }

The phase that is passed in is 6.44245E+8 there are two entries in the dashArray 28.8, and 9.6 however having such a large number for the phase causes the first while get stuck because the 28.8 is not significant enough to decrease the phase based on float's resolution.

I do not know enough about the internals or I would consider making changes.

I am really only interested in extracting the text, so if there is a setting I can implement to filter out the line processing that would work for me too.

Upvotes: 0

Views: 773

Answers (1)

Lee
Lee

Reputation: 11

I updated the LineDashPattern.cs file. I am using the iTextSharp, and as far as I know the 5.5.9 is the latest release, so iText 7 might be Java.

Anyhow, here is the code that I updated. I added a elts (sum of the line elements) as a private field in the class, updated the dashArray property set routine to update elts based on the current dashArray, and finally updated the InitFirst method to divide the phase by the elts doing a bulk of the computation in the one statement then falling into the original code to find the actual element.

I do not know in general what phase value are typically passed into the routine, but my value if they could have adjusted the phase would have looped nearly 17 million times, so this change should be significantly faster and since it was called multiple times for this PDF it becomes an even greater performance improvement, not to mention addressing the bug. The full file code is below:

/*
 * $Id$
 *
 * This file is part of the iText (R) project.
 * Copyright (c) 1998-2016 iText Group NV
 * Authors: Bruno Lowagie, Paulo Soares, et al.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU Affero General Public License version 3
 * as published by the Free Software Foundation with the addition of the
 * following permission added to Section 15 as permitted in Section 7(a):
 * FOR ANY PART OF THE COVERED WORK IN WHICH THE COPYRIGHT IS OWNED BY
 * ITEXT GROUP. ITEXT GROUP DISCLAIMS THE WARRANTY OF NON INFRINGEMENT
 * OF THIRD PARTY RIGHTS
 *
 * This program is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
 * or FITNESS FOR A PARTICULAR PURPOSE.
 * See the GNU Affero General Public License for more details.
 * You should have received a copy of the GNU Affero General Public License
 * along with this program; if not, see http://www.gnu.org/licenses or write to
 * the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
 * Boston, MA, 02110-1301 USA, or download the license from the following URL:
 * http://itextpdf.com/terms-of-use/
 *
 * The interactive user interfaces in modified source and object code versions
 * of this program must display Appropriate Legal Notices, as required under
 * Section 5 of the GNU Affero General Public License.
 *
 * In accordance with Section 7(b) of the GNU Affero General Public License,
 * a covered work must retain the producer line in every PDF that is created
 * or manipulated using iText.
 *
 * You can be released from the requirements of the license by purchasing
 * a commercial license. Buying such a license is mandatory as soon as you
 * develop commercial activities involving the iText software without
 * disclosing the source code of your own applications.
 * These activities include: offering paid services to customers as an ASP,
 * serving PDFs on the fly in a web application, shipping iText with a closed
 * source product.
 *
 * For more information, please contact iText Software Corp. at this
 * address: [email protected]
 */

using System.util;
using iTextSharp.awt.geom;

namespace iTextSharp.text.pdf.parser {

    /**
     * Represents the line dash pattern. The line dash pattern shall control the pattern
     * of dashes and gaps used to stroke paths. It shall be specified by a dash array and
     * a dash phase.
     *
     * @since 5.5.6
     */
    public class LineDashPattern {

        private PdfArray dashArray;
        private float dashPhase;

        private int currentIndex;
        private int elemOrdinalNumber = 1;
        private DashArrayElem currentElem;
        private float elts = 0.0F;

        /**
         * Creates new {@link LineDashPattern} object.
         * @param dashArray The dash array. See {@link #getDashArray()}
         * @param dashPhase The dash phase. See {@link #getDashPhase()}
         */
        public LineDashPattern(PdfArray dashArray, float dashPhase) {
            this.dashArray = new PdfArray(dashArray);
            this.dashPhase = dashPhase;
            InitFirst(dashPhase);
        }

        /**
         * Getter and setter for the dash array.
         *
         * The dash array’s elements is number that specify the lengths of
         * alternating dashes and gaps; the numbers are nonnegative. The
         * elements are expressed in user space units.
         *
         * @return The dash array.
         */
        public PdfArray DashArray {
            get { return dashArray; }
            set 
            { 
              dashArray = value;
              float elts = 0.0F;
              for (int i = 0; i < dashArray.Size; i++)
              {
                elts += dashArray.GetAsNumber(i).FloatValue;
              }
            }
        }

        /**
         * Getter and setter for the dash phase.
         *
         * The dash phase shall specify the distance into the dash pattern at which
         * to start the dash. The elements are expressed in user space units.
         *
         * @return The dash phase.
         */
        public float DashPhase {
            get { return dashPhase; }
            set { dashPhase = value; }
        }

        /**
         * Calculates and returns the next element which is either gap or dash.
         * @return The next dash array's element.
         */
        public DashArrayElem Next() {
            DashArrayElem ret = currentElem;

            if (dashArray.Size > 0) {
                currentIndex = (currentIndex + 1) % DashArray.Size;
                currentElem = new DashArrayElem(dashArray.GetAsNumber(currentIndex).FloatValue,
                    IsEven(++elemOrdinalNumber));
            }

            return ret;
        }

        /**
         * Checks whether the dashed pattern is solid or not. It's solid when the
         * size of a dash array is even and sum of all the units off in the array
         * is 0.<br/>
         * For example: [3 0 4 0 5 0 6 0] (sum is 0), [3 0 4 0 5 1] (sum is 1).
         */
        public bool IsSolid() {
            if (dashArray.Size % 2 != 0) {
                return false;
            }

            float unitsOffSum = 0;

            for (int i = 1; i < dashArray.Size; i += 2) {
                unitsOffSum += dashArray.GetAsNumber(i).FloatValue;
            }

            return Util.Compare(unitsOffSum, 0) == 0;
        }

        /**
         * Resets the dash array so that the {@link #next()} method will start
         * from the beginning of the dash array.
         */
        public void Reset() {
            currentIndex = 0;
            elemOrdinalNumber = 1;
            InitFirst(dashPhase);
        }

        private void InitFirst(float phase) {
            if (dashArray.Size > 0) {
              // handle a bulk of the line pattern
              //
              if (elts > 0.0)
              {
                int occurances = (int)(phase / elts);
                elemOrdinalNumber = occurances * dashArray.Size;
                phase -= occurances * elts;

                // adjust for the final set of pattern elements
                //
                while (phase > 0)
                {
                  phase -= dashArray.GetAsNumber(currentIndex).FloatValue;
                  currentIndex = (currentIndex + 1) % DashArray.Size;
                  elemOrdinalNumber++;
                }

                if (phase < 0)
                {
                  --elemOrdinalNumber;
                  --currentIndex;
                  currentElem = new DashArrayElem(-phase, IsEven(elemOrdinalNumber));
                }
                else
                {
                  currentElem = new DashArrayElem(dashArray.GetAsNumber(currentIndex).FloatValue,
                      IsEven(elemOrdinalNumber));
                }
              }
            }
        }

        private bool IsEven(int num) {
            return (num % 2) == 0;
        }

        public class DashArrayElem {

            private float val;
            private bool isGap;

            public DashArrayElem(float val, bool isGap) {
                this.val = val;
                this.isGap = isGap;
            }

            public float Value
            {
                get { return val; }
                set { val = value; }
            }

            public bool IsGap
            {
                get { return isGap; }
                set { isGap = value; }
            }
        }
    }
}

Upvotes: 1

Related Questions