Reputation: 25
I'm facing an errors regarding to the counting pages on PDF by using the VB.NET. Actually my code can work, I can count the pages of the PDF, but certain PDF my code cannot count it. Is it the PDF need to set any setting?
Below is the sample code I'm using now:
Dim SR As New StreamReader("C:\Users\lee_chun_yong\Desktop\New folder\abc.pdf")
Dim PDFData As String = SR.ReadToEnd
Dim StartIndex As Integer 'Starting index of the Pages Object
Dim EndIndex As Integer 'Ending index of the Pages Object
Dim CountIndex As Int16 'Starting index of "/Count"
Dim chars() As Char = {"/", ">"}
Dim tmp As String
Dim CountEndIndex As Int16 'Index of next "/" after "/Count"
Dim tmpIndex1, tmpIndex2 As Integer
Dim PageCount As Integer
Dim TypePagesIndex As Integer
Do
'Get an Object of type 'Pages' from PDF file
'It can be "/Type /Pages" or "/Type/Pages"
tmpIndex1 = PDFData.IndexOf("/Type /Pages")
tmpIndex2 = PDFData.IndexOf("/Type/Pages")
'Different possibilities of 2 indices
If tmpIndex1 > -1 And tmpIndex1 < tmpIndex2 Then
TypePagesIndex = tmpIndex1
ElseIf tmpIndex2 > -1 And tmpIndex2 < tmpIndex1 Then
TypePagesIndex = tmpIndex2
ElseIf tmpIndex1 = -1 And tmpIndex2 > -1 Then
TypePagesIndex = tmpIndex2
ElseIf tmpIndex2 = -1 And tmpIndex1 > -1 Then
TypePagesIndex = tmpIndex1
Else 'tmpIndex1 = -1 And tmpIndex2 = -1
Exit Do
End If
tmp = PDFData.Substring(0, TypePagesIndex)
StartIndex = tmp.LastIndexOf("<<")
tmp = PDFData.Substring(TypePagesIndex)
EndIndex = TypePagesIndex + tmp.IndexOf(">>") + 1
tmp = PDFData.Substring(StartIndex, EndIndex - StartIndex + 1)
'Now tmp="<< /Kids, /Count etc >>"
'the pagecount is just after "/Count " in tmp
CountIndex = tmp.IndexOf("/Count")
CountIndex += 7 'Move index to the end of "/Count "
tmp = tmp.Substring(CountIndex)
'now tmp="Pagecount ....>>"
'Pagecount is followd by a newline like char and then "/" or ">>"
CountEndIndex = tmp.IndexOfAny(chars)
tmp = tmp.Substring(0, CountEndIndex) 'Get the PageCount
If PageCount < Val(tmp) Then
PageCount = Val(tmp)
End If
PDFData = PDFData.Substring(EndIndex + 1)
Loop
Upvotes: 0
Views: 486
Reputation: 95918
Your code makes very many assumptions which need not be true:
You expect the page tree nodes (especially the page tree root node) to be plainly readable. That need not be the case, these nodes can be put in object streams which in turn can be compressed. This may make you miss some or all page tree nodes.
You expect /Type and /Pages in the page tree nodes to either immediately follow each other or be separated by a single space. This need not be the case, there can be any kind and number of whitespace characters in-between, there even may be a comment! You again can miss nodes here.
You expect the Count value to immediately be an integer; it may also be a reference to some indirect object containing that integer. In this case your code takes the object number as page count.
You assume /Type /Pages can only occur in page tree nodes currently in use. This is wrong. This sequence of characters can also occur
There surely still are some more assumptions in your code but those above come to my mind immediately.
I would advice you use some existing PDF library to retrieve the page counts.
If that is not possible, do read a PDF as it is meant to be read. I.e. read the trailer or cross references stream dictionary to find the catalog, read the catalog to find the page tree root node, read the count of that root node. Use cross reference streams or tables to find these objects. In other words: be sure to follow the specification ISO 32000-1 instead of merely checking some example PDFs.
Upvotes: 2