Reputation: 644
I am using chromedp to download PDF files. I am able to work with complicated pages where pdf loads inside iframe (using code similar to download_file example). I am doing this by detecting the iframe first, then loading the iframe separately, then detecting the pdfViewer then clicking on the #download
button. Example working URL https://www.sebi.gov.in/filings/public-issues/sep-2021/tamilnad-mercantile-bank-limited_52434.html
But I am NOT able to download for simple cases below. Here I have PDF link directly. The code in download_file
example just loads the document and does not trigger the download. So I tried below code to directly download as in download_image example. When I hit these url in google-chrome they work fine, I guess it gets loaded in some default PDF extension of google chrome.
I have tried multiple chromedp and cdproto versions, two of them are below
github.com/chromedp/cdproto v0.0.0-20240721024200-dac8efcb39ce
github.com/chromedp/chromedp v0.9.5
And
github.com/chromedp/cdproto v0.0.0-20240801214329-3f85d328b335
github.com/chromedp/chromedp v0.10.0
I also tried Printing the PDF using example similar to this example, but it gives blank pdf.
https://www.bseindia.com/bseplus/AnnualReport/543258/74183543258.pdf
This one is actually downloading some html file and not the original PDF. I checked this by doing cat on download.pdfcat download.pdf
<!doctype html><html><body style='height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);'><embed name='91302F098E174F9DE7C97CF2F96C4F5E' style='position:absolute; left: 0; top: 0;'width='100%' height='100%' src='about:blank' type='application/pdf' internalid='91302F098E174F9DE7C97CF2F96C4F5E'></body></html>%
I am able to correctly download this with curl.
https://nsearchives.nseindia.com/content/equities/IPO_RHP_UNICOMM.pdf
Error : page load error net::ERR_HTTP2_PROTOCOL_ERROR
For this I am not even able to download this with curl and that also give same error. I have asked stackoverflow question for the same here.
func main() {
url1 := "https://www.bseindia.com/bseplus/AnnualReport/543258/74183543258.pdf"
//url2 := "https://nsearchives.nseindia.com/content/equities/IPO_RHP_UNICOMM.pdf"
Chromepd_download(url1)
}
func Chromepd_download(urlstr string) {
ctx, cancel := chromedp.NewContext(
context.Background(),
chromedp.WithLogf(log.Printf),
chromedp.WithDebugf(log.Printf),
)
defer cancel()
// create a timeout as a safety net to prevent any infinite wait loops
ctx, cancel = context.WithTimeout(ctx, 60*time.Second)
defer cancel()
// set up a channel, so we can block later while we monitor the download
// progress
done := make(chan bool)
var requestID network.RequestID
chromedp.ListenTarget(ctx, func(v interface{}) {
switch ev := v.(type) {
case *network.EventRequestWillBeSent:
log.Printf("EventRequestWillBeSent: %v: %v", ev.RequestID, ev.Request.URL)
if ev.Request.URL == urlstr {
requestID = ev.RequestID
}
case *network.EventLoadingFinished:
log.Printf("EventLoadingFinished: %v", ev.RequestID)
if ev.RequestID == requestID {
close(done)
}
}
})
// all we need to do here is navigate to the download url
if err := chromedp.Run(ctx,
chromedp.Navigate(urlstr),
); err != nil {
log.Fatal(err)
}
// This will block until the chromedp listener closes the channel
<-done
// get the downloaded bytes for the request id
var buf []byte
if err := chromedp.Run(ctx, chromedp.ActionFunc(func(ctx context.Context) error {
var err error
buf, err = network.GetResponseBody(requestID).Do(ctx)
return err
})); err != nil {
log.Fatal(err)
}
// write the file to disk - since we hold the bytes we dictate the name and
// location
if err := os.WriteFile("download.pdf", buf, 0644); err != nil {
log.Fatal(err)
}
log.Print("wrote download.pdf")
}
Upvotes: 0
Views: 113