Reputation: 45
I am generating a very large PDF (300+pages) based on user-input HTML. I have this working beautifully thanks to some great samples out there. My next requirement is to generate a dynamic table of contents complete with internal links to those places in the PDF where the chapters start. I have part of that working partially. I can create internal PDF links that work. The part I need help with, is that the page number is unknown. I have tried creating the main PDF first and then spinning through that to get the page number based on finding the text "Chapter one", but it is way too slow, given the size of the document and the number of chapters.
Can I detect the current page number while adding to the document? When I am creating the PDF from HTML, I know when I am at a new Chapter, but is there a way to ask iTextSharp which page we are currently on so I can use that number in my table of contents? That way I can build it alongside the main document and then merge them afterward? Are there better ideas out there?
This is how I am generating the PDF from user-input HTML:
Dim document As New Document()
Dim strManualFile As String = "file.pdf"
PdfWriter.GetInstance(document, New FileStream(strManualFile, FileMode.Create, FileAccess.Write, FileShare.ReadWrite))
document.Open()
Dim htmlarraylistBody As List(Of IElement) = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(New StringReader(GetManualHTML()), Nothing)
For l As Integer = 0 To htmlarraylistBody.Count - 1
document.Add(DirectCast(htmlarraylistBody(l), IElement))
Next
document.Close()
document.Dispose()
Upvotes: 0
Views: 807
Reputation: 55427
PdfWriter.GetInstance()
returns an object that you can query to find the current page number, so that's the first thing you should know. If you have control over your HTML I would inject a flag variable that you can watch for in your For
loop later. If you find the flag variable, do something, otherwise just add the content as normal.
Just a quick warning, HTMLWorker
has been deprecated for a very long time and is not being maintained. All work is instead being done in the XmlWorker
library which supports CSS. If you're stuck using an older version because of the license change you should probably read this to find out the myths and facts about the old license.
Below is a full working sample that shows off the flag variable. At the top I create some sample HTML that you'd obviously remove and replace with your real HTML. Then I create a standard document and loop through each item as you did. Inside of the loop I check for the flag variable and if found store it, otherwise add the element just as you did.
This code targets iTextSharp 5.4.4. If you're using the older version of iTextSharp then the Using
statements might not work, simple turn them into Dim
statements and remove the End Using
(or upgrade to the most recent version). See the code for additional comments
''//File to write to
Dim TestFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Test.pdf")
''//Create a flag value to search for. We won't write this to the PDF, it is just for searching.
Dim FlagValue = "!!UNIQUE TEXT!!"
''//Build our sample HTML. The real version of this would get the HTML from another source ideally.
Dim sampleHTML = <body/>
For I As Integer = 1 To 10
''//Just before inserting our chapter headings we insert our flag value appended with the current chapter number.
''//NOTE: This might need to be played with a little bit. I'm not sure if a new page is created by the previous entity
''// closing or the new entity starting.
sampleHTML.Add(String.Format("{0}{1}", FlagValue, I))
sampleHTML.Add(<h1><%= String.Format("Chapter {0}", I) %></h1>)
''//Add some some paragraphs
For J As Integer = 1 To 100
sampleHTML.Add(<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Suspendisse ac arcu porta, tempor justo eu, tincidunt eros.
Integer lorem dolor, pretium sit amet vehicula dapibus,
faucibus a tellus.</p>)
Next
Next
''//This will be our collection of chapter numbers and the actual page numbers that they correspond to.
Dim PageNumbers As New Dictionary(Of String, Integer)
''//Standard PDF setup here, nothing special
Using fs As New FileStream(TestFile, FileMode.Create, FileAccess.Write, FileShare.None)
Using doc As New Document()
Using writer = PdfWriter.GetInstance(doc, fs)
doc.Open()
''//Parse our HTML
Dim htmlarraylistBody = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(New StringReader(sampleHTML.ToString()), Nothing)
''//Loop through each item
For Each Elem In htmlarraylistBody
''//Some HTML elements freak the system out so you should check if they are content first.
If Elem.IsContent() Then
''//If the current element is a paragraph and start with our flag value
If (TypeOf Elem Is Paragraph) AndAlso DirectCast(Elem, Paragraph).Content.StartsWith(FlagValue) Then
''//Add that to our master collection but DO NOT write it to the PDF
PageNumbers.Add(DirectCast(Elem, Paragraph).Content.Replace(FlagValue, ""), writer.PageNumber)
Else
''//Otherwise just write to the PDF normally
doc.Add(Elem)
End If
End If
Next
doc.Close()
End Using
End Using
End Using
Upvotes: 2