Counting pages error on pdf by using VB.NET

Question

I'm facing an errors regarding to the counting pages on PDF by using the VB.NET. Actually my code can work, I can count the pages of the PDF, but certain PDF my code cannot count it. Is it the PDF need to set any setting?

Below is the sample code I'm using now:

Dim SR As New StreamReader("C:\Users\lee_chun_yong\Desktop\New folder\abc.pdf")
Dim PDFData As String = SR.ReadToEnd
Dim StartIndex As Integer   'Starting index of the Pages Object
Dim EndIndex As Integer     'Ending index of the Pages Object
Dim CountIndex As Int16     'Starting index of "/Count"
Dim chars() As Char = {"/", ">"}
Dim tmp As String
Dim CountEndIndex As Int16  'Index of next "/" after "/Count"
Dim tmpIndex1, tmpIndex2 As Integer
Dim PageCount As Integer
Dim TypePagesIndex As Integer

Do
    'Get an Object of type 'Pages' from PDF file
    'It can be "/Type /Pages" or "/Type/Pages"
    tmpIndex1 = PDFData.IndexOf("/Type /Pages")
    tmpIndex2 = PDFData.IndexOf("/Type/Pages")
    'Different possibilities of 2 indices
    If tmpIndex1 > -1 And tmpIndex1 < tmpIndex2 Then
        TypePagesIndex = tmpIndex1
    ElseIf tmpIndex2 > -1 And tmpIndex2 < tmpIndex1 Then
        TypePagesIndex = tmpIndex2
    ElseIf tmpIndex1 = -1 And tmpIndex2 > -1 Then
        TypePagesIndex = tmpIndex2
    ElseIf tmpIndex2 = -1 And tmpIndex1 > -1 Then
        TypePagesIndex = tmpIndex1
    Else  'tmpIndex1 = -1 And tmpIndex2 = -1
        Exit Do
    End If

    tmp = PDFData.Substring(0, TypePagesIndex)
    StartIndex = tmp.LastIndexOf("<<")
    tmp = PDFData.Substring(TypePagesIndex)
    EndIndex = TypePagesIndex + tmp.IndexOf(">>") + 1
    tmp = PDFData.Substring(StartIndex, EndIndex - StartIndex + 1)
    'Now tmp="<< /Kids, /Count etc >>"
    'the pagecount is just after "/Count " in tmp
    CountIndex = tmp.IndexOf("/Count")
    CountIndex += 7  'Move index to the end of "/Count "

    tmp = tmp.Substring(CountIndex)
    'now tmp="Pagecount ....>>"
    'Pagecount is followd by a newline like char and then "/" or ">>"
    CountEndIndex = tmp.IndexOfAny(chars)
    tmp = tmp.Substring(0, CountEndIndex) 'Get the PageCount
    If PageCount < Val(tmp) Then
        PageCount = Val(tmp)
    End If
    PDFData = PDFData.Substring(EndIndex + 1)
Loop

mkl · Accepted Answer

Your code makes very many assumptions which need not be true:

You expect the page tree nodes (especially the page tree root node) to be plainly readable. That need not be the case, these nodes can be put in object streams which in turn can be compressed. This may make you miss some or all page tree nodes.
You expect /Type and /Pages in the page tree nodes to either immediately follow each other or be separated by a single space. This need not be the case, there can be any kind and number of whitespace characters in-between, there even may be a comment! You again can miss nodes here.
You expect the Count value to immediately be an integer; it may also be a reference to some indirect object containing that integer. In this case your code takes the object number as page count.
You assume /Type /Pages can only occur in page tree nodes currently in use. This is wrong. This sequence of characters can also occur
- in nodes which are not referenced from the page tree; some PDF processors while manipulating a PDF don't delete old objects but merely stop referencing them. If they remove pages, your code will still see the former higher counts and, therefore, assume a higher page count;
- in private application data; PDF allows the insertion of private application data which may contain dictionaries with /Type /Pages and a Count entry whose value has nothing to do with the actual page count;
- in arbitrary PDF strings; PDFs explaining the structure of PDF files may well contain /Type /Pages in the page content (which in turn may be uncompressed) or in meta data. In that case your code will check a nearby dictionary which is not a page tree node but may still have a Count entry;
- in embedded files; PDFs can contain embedded files; if there is another PDF embedded in a PDF without further compression, your code treats the page tree nodes of that embedded PDF as if they were page tree nodes of the outer PDF.

There surely still are some more assumptions in your code but those above come to my mind immediately.

I would advice you use some existing PDF library to retrieve the page counts.

If that is not possible, do read a PDF as it is meant to be read. I.e. read the trailer or cross references stream dictionary to find the catalog, read the catalog to find the page tree root node, read the count of that root node. Use cross reference streams or tables to find these objects. In other words: be sure to follow the specification ISO 32000-1 instead of merely checking some example PDFs.

Counting pages error on pdf by using VB.NET

Answers (1)

Related Questions