Reputation: 455
Scenario:
We are required to enter data daily into a government database in a European country. We suddenly have a need to retrieve some of that data. But the only format they will allow is by PDFs generated from the data—hundreds of them. We would like to avoid sitting in front of a webbrowser clicking link after link.
The links generated look like
<a href='javascript:viajeros("174814255")'>
<img src="img/pdf.png">
</a>
I have almost no experience with Javascript, so I don't know whether I can install a routine as a bookmark to loop through the DOM, find all the links, and call the function. Nor, if that's possible, how to write it.
The ID numbers can't be predicted, so I can't write another page or curl/wget script to do it. (And if I could, it would still fail as mentioned below.)
The 'viajeros' function is simple:
function viajeros(id){
var idm = document.forms[0].idioma.value;
window.open("parteViajeros.do?lang="+idm+"&id_fichero=" + id);
}
but feeding that URI to curl or wget fails. Apparently they check either a cookie or REFERER and generate an error.
Besides, with each link putting the PDF in a browser tab instead of in the downloads directory, we would still have to do two clicks (tab and save) hundreds of times.
What should I do instead?
For what it's worth, this is on MacOS 10.13.4. I normally use Safari, but I also have available Opera and Firefox. I could install Chrome, but that's the last resort. No, that's second to last: we also have a (shudder) Windows 10 laptop. THAT'S last resort.
(Note: I looked at the four suggested duplicates that seemed promising, but each either had no answer or instructed the asker to modify the code that generates the PDF.)
Upvotes: 1
Views: 919
Reputation: 5874
document.querySelectorAll("img[src=\"img/pdf.png\"]")
.forEach((el, i) => {
let id = el.parentElement.href.split("\"")[1];
let url =
"parteViajeros.do?lang=" + document.forms[0].idioma.value +
"&id_fichero=" + id;
setTimeout(() => {
downloadURI(url, id);
}, 1500 * i)
});
This gets all of the images of the PDF icon, then looks at their parent for the link target. This href has its ID extracted, and passed to a string construction making the path to the file to be downloaded, similar to ‘viajeros’ but without the window.open
. This URL is then passed to downloadURI
which performs the download.
This uses downloadURI
function from another Stack Overflow answer. You can download a URL by setting the download attribute on the link, then clicking it, which is implemented as so. This is only tested in Chrome.
function downloadURI(uri, name) { var link = document.createElement("a"); link.download = name; link.href = uri; document.body.appendChild(link); link.click(); document.body.removeChild(link); delete link; }
Open the page with the links and open the console. Paste the downloadURI
function first, then the code above to download all the links.
Upvotes: 1
Reputation: 4885
I had a similar situation, where I have to download all the (invoice) pdf that were generated in a day or past week.
So after some research I was able to do the scraping using PhantomJS and later I discovered casperjs which made my job easy.
phantomJs and casperjs are headless browsers.
Since you have less experience with JS and If you are a C# guy then CefSharp may help you.
Some Useful links:
To get started with phantom, casper and cefSharp
Try reading the documentation for downloading files.
Upvotes: 1