programmersmurf
programmersmurf

Reputation: 93

Convert Web Page to PDF or Image

I need to convert a web page [which has not public access] to PDF or Image [preferably to PNG].

Web page contains set of charts and image. Most of the charts are populated through Ajax calls so there is a delay between page load and chart load.

I am looking answer for any of these questions:

1- I found set of snapshot api's but none of them support accessing my internal page. Since the web page I am trying to export is not public I need to be authenticated. Biggest problem is I cannot send request headers [such as session-id, cookie or other variables] along with these API's. It seems they don't support this kind of functionality.

2- I am not sure if I can do following: Login to my web page with HTTP client, add http headers, send get call and get HTML string. Then use one of the converters to convert it to PDF. What I am not sure is if it's possible to get proper PDF from the HTML string I got from http client since resources [css, js and etc] will be missing. I want my pdf/image looks exactly as it on the web site.

I really appreciate if you can help.

Thanks in advance,

ED

Upvotes: 1

Views: 1744

Answers (2)

OnceUponATimeInTheWest
OnceUponATimeInTheWest

Reputation: 1222

Authentication is difficult because it involves security. Because the operation you are describing is unusual it is likely to result in all kinds of alarm bells going off. It is entirely possible to do but it is fraught, easy to get wrong and fragile in the face of security updates and code changes.

As such I'm going to suggest an alternate method which is one we often recommend for ABCpdf (on which I work). Yes we support standard authentication methods but the beauty of this approach is that it is robust and is applicable to other solutions (eg Java based) and novel authentication methods.

Typically you just want a PDF of the current page. The easiest way to do this is snaffle the HTML. The way you do this rather depends on your environment. For example under ASP.NET you can obtain the HTML of the current page using the HttpResponse.Filter property or by overriding the Render method of the page. The way you do it will depend on what you're coding in.

Then you need to save this HTML to a file and present it to your solution via a 'file://' protocol URL. Now obviously at this point any relative links will be broken but this is easily fixed by dropping in a BASE tag that references the place they are located.

Generally the types of resources referenced by an server-side page are static. So if you can create a tag that references the actual files rather than a web site, you will bypass any authentication for access to these resources.

That still leaves the AJAX based problems which are another can of worms. The render delay method is something we have supported for many years (from before AJAX was around) however it is not terribly reliable because you just don't know how long to wait.

Much better is a tighter link into the JavaScript via a callback you can use to determine if the page is loaded. I don't think ABCpdf is going to be appropriate for you since it is .NET but I would certainly encourage you to look for a Java based solution that uses this type of more sophisticated approach.

Upvotes: 0

user1914292
user1914292

Reputation: 1556

You're probably best of using wkhtmltopdf, which is a server-side tool and is easily installed.

There are two parameters you can use to wait for your Ajax to finish, try:

  • javascript-delay to influence the time the program waits for the JavaScript to finish
  • window-status to wait for a certain return code for the window

See the extensive manual for this program here

wkhtmltopdf generates a PDF and wkhtmltoimg generates an image, which is PNG (as you requested) by default.

Upvotes: 1

Related Questions