Tales Pádua
Tales Pádua

Reputation: 1461

How can I find the URL that downloads a file?

I am developing a web scraper and I need to download a .pdf file from a page. I can get the file name from the html tag, but can't find the complete url (or request body) that downloads the file.

I have tried to sniff the traffic with the chrome and firefox network traffic tool and with wireshark, with no success. I can see it make a post request to the exact same url as the page itself, and so I can't understand why this happens. My guess is that the filename is being sent inside the POST request body, but I also can't find that information in those tools. If I could see the variable name in the body, I could create a copy of the request and then get the file.

How can I get that information?

Here is the website I am talking about: http://www2.trt8.jus.br/consultaprocesso/formulario/ProcessoConjulgado.aspx?sDsTelaOrigem=ListarProcessos.aspx&iNrInstancia=1&sFlTipo=T&iNrProcessoVaraUnica=126&iNrProcessoUnica=1267&iNrProcessoAnoUnica=2010&iNrRegiaoUnica=8&iNrJusticaUnica=5&iNrDigitoUnica=24&iNrProcesso=1267&iNrProcessoAno=2010&iNrProcesso2a=0&iNrProcessoAno2a=0

EDIT: for those seeking to do something similar, take a look at this website: http://curl.trillworks.com/
It converts a cURL to a python requests code. Very useful

Upvotes: 3

Views: 7533

Answers (1)

Gideon Pyzer
Gideon Pyzer

Reputation: 23998

The POST data used for the request is encoded content generated by ASP.NET. It contains various state/session information of the page that the link is on. This makes it difficult to directly scrape for the URL.

You can examine the HAR by exporting it from the Network tab in Chrome DevTools:

Network HAR

The __EVENTVALIDATION data is used to ensure events raised on the client originate from the controls rendered on the page from the server.

You might be able to achieve what you want by requesting the page the link is on first, then extract the required POST data from the response (containing the page state and embedded request for file), and then make a new request with this information. This assumes the server doesn't expire any sessions in the meantime.

Upvotes: 2

Related Questions