Dennis
Dennis

Reputation: 3678

iTextSharp exception "Stack empty" when getting text from a PDF page

I am trying to loop through each page on a PDF to look for specific keywords. Code works fine on other PDFs, except this one

My code

Using oReader As New pdf.PdfReader(pdfFilename)

    For pCurrent = oReader.NumberOfPages To 1 Step -1
        Dim strategy As pdf.parser.ITextExtractionStrategy = New pdf.parser.SimpleTextExtractionStrategy()
        Dim pageText As String = pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, pCurrent, strategy)

        '
        'search for keywords
        '
        'FindVOI

    Next 'proceed next page

End Using

Here is the snippet of code that causing this exception,

Dim pageText As String = pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, pCurrent, strategy)

Is showing exception Stack empty at page 98 on this PDF, any ideas what is wrong?

Full Exception:

Exception thrown: 'System.InvalidOperationException' in System.dll
System.Transactions Critical: 0 : <TraceRecord xmlns="http://schemas.microsoft.com/2004/10/E2ETraceEvent/TraceRecord" Severity="Critical"><TraceIdentifier>http://msdn.microsoft.com/TraceCodes/System/ActivityTracing/2004/07/Reliability/Exception/Unhandled</TraceIdentifier><Description>Unhandled exception</Description><AppDomain>VipMonitorService.vshost.exe</AppDomain><Exception><ExceptionType>System.InvalidOperationException, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089</ExceptionType><Message>Stack empty.</Message><StackTrace>   at System.ThrowHelper.ThrowInvalidOperationException(ExceptionResource resource)
   at System.Collections.Generic.Stack`1.Pop()
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.EndMarkedContentC.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener, IDictionary`2 additionalContentOperators)
   at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
   at WatcherApp.VipMonitorService.PDFHelper.FindVOI(List`1 voiList, String pdfFilename, Boolean searchFromLast, Int32 searchNumberOfPagesInPercent) in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\PDFHelper.vb:line 59
   at WatcherApp.VipMonitorService.Controller.ProcessAnnualReport(Announcement a) in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 456
   at WatcherApp.VipMonitorService.Controller.ProcessARInQueueThread() in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 362
   at WatcherApp.VipMonitorService.Controller._Lambda$__40-0() in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 339
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()</StackTrace><ExceptionString>System.InvalidOperationException: Stack empty.
   at System.ThrowHelper.ThrowInvalidOperationException(ExceptionResource resource)
   at System.Collections.Generic.Stack`1.Pop()
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.EndMarkedContentC.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener, IDictionary`2 additionalContentOperators)
   at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
   at WatcherApp.VipMonitorService.PDFHelper.FindVOI(List`1 voiList, String pdfFilename, Boolean searchFromLast, Int32 searchNumberOfPagesInPercent) in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\PDFHelper.vb:line 59
   at WatcherApp.VipMonitorService.Controller.ProcessAnnualReport(Announcement a) in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 456
   at WatcherApp.VipMonitorService.Controller.ProcessARInQueueThread() in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 362
   at WatcherApp.VipMonitorService.Controller._Lambda$__40-0() in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 339
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()</ExceptionString></Exception></TraceRecord>

Upvotes: 2

Views: 1085

Answers (1)

mkl
mkl

Reputation: 95963

Is showing exception Stack empty at page 98 on this PDF, any ideas what is wrong?

The stack trace shows that the Stack empty occurs at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.EndMarkedContentC.Invoke. Thus, we should look at the starting and ending marked content operators:

tag BMC Begin a marked-content sequence terminated by a balancing EMC operator. tag shall be a name object indicating the role or significance of the sequence.

tag properties BDC Begin a marked-content sequence with an associated property list, terminated by a balancing EMC operator. tag shall be a name object indicating the role or significance of the sequence. properties shall be either an inline dictionary containing the property list or a name object associated with it in the Properties subdictionary of the current resource dictionary (see 14.6.2, “Property Lists”).

EMC End a marked-content sequence begun by a BMC or BDC operator.

(Table 320 – Marked-content operators, ISO 32000-1)

If you look at the BDC/BMC and EMC starts and ends of marked content on the page in question, you'll see:

/Artifact <</O /Layout >>BDC
EMC 
/Artifact <</O /Layout >>BDC  
EMC  
/Artifact <</O /Layout >>BDC  
EMC 
/Artifact <</BBox [0 33.8887 407.4289 0 ]/O /Layout >>BDC  
EMC 
EMC
...

Thus, there is a surplus EMC operator for which there is no BMC or BDC operator to end the marked content of.

Thus, this document is not a valid PDF; in particular, its marked content structure is broken.


That been said, it would be appropriate if iTextSharp would check the stack before the Pop and optionally either throw a more tangible exception or ignore the EMC operator.

Upvotes: 3

Related Questions