Reputation: 4434
I am trying to use some Python web crawler to download about 3000 PDFs from a website. However, the URLs of those PDFs are generated by JavaScript function. So, I am wondering if there is any tutorial on how to achieve this?
For example, the URL linked to Alberto European Hairspray (Aerosol) - All Variants
will be generated after clicking onclick="javascript:__doPostBack('ctl00$placeBody$gridView$gridView','DocumentCenter.aspx?did={0}$0'
.
So the question is how to let the web crawler to get the computed URL.
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
<tbody>
<tr>
<td>
<input type="image" src="App_Graphics/PDFDocument.gif" alt="MSDS" onclick="javascript:__doPostBack('ctl00$placeBody$gridView$gridView','DocumentCenter.aspx?did={0}$0');return false;" />
</td>
<td><a href="javascript:__doPostBack('ctl00$placeBody$gridView$gridView','MSDSDetail.aspx?did={0}$0')">Alberto European Hairspray (Aerosol) - All Variants</a>
</td>
<td>Unilever PLC</td>
<td>8131-01</td>
</tr>
<tr class="row-alternate">
<td>
<input type="image" src="App_Graphics/PDFDocument.gif" alt="MSDS" onclick="javascript:__doPostBack('ctl00$placeBody$gridView$gridView','DocumentCenter.aspx?did={0}$1');return false;" />
</td>
<td><a href="javascript:__doPostBack('ctl00$placeBody$gridView$gridView','MSDSDetail.aspx?did={0}$1')">Alberto European Mousse (Aerosol) - All Variants</a>
</td>
<td>Unilever PLC</td>
<td>8132-01</td>
</tr>
</tbody>
Upvotes: 1
Views: 421
Reputation: 6277
Another option is that you might use Selenium to execute js and get computed urls.
Upvotes: 1
Reputation: 9599
You can't. Use a JavaScript interpreter (SpiderMonkey, for example) to execute the code and then go ahead with HTML parsing. Using Qt's WebKit is a good approach also, but probably slower.
Upvotes: 1