user9432770
user9432770

Reputation:

Extract embedded pdf

I noted that docplayer.net embeds many pdfs. Example: http://docplayer.net/72489212-Excellence-in-prevention-descriptions-of-the-prevention-programs-and-strategies-with-the-greatest-evidence-of-success.html

However, how does the process of extracting these pdfs (i.e. downloading them) using an automated workflow work?

Upvotes: 5

Views: 12854

Answers (3)

Gialli
Gialli

Reputation: 11

Open developer tools, open the Network tab in the inspector, and select "Copy... Copy as PowerShell", add -OutFile "C:\pdf.pdf" at the end.

Upvotes: 1

Paul Brannan
Paul Brannan

Reputation: 1713

As you pointed out, grabbing the URL alone results in a 403 Forbidden. There are two headers you also need, "s" and "ex".

To get these using Firefox, open the Network tab in the inspector, and select "Copy... Copy as cURL". The resulting curl command will be the exact request the browser would have made to fetch the resource. In addition to the "s" and "ex" headers, you will also notice that there is a "Range" header -- make sure to remove this one, unless you only want to download part of the file. The remaining headers are not relevant.

I will not post the resulting direct link to the PDF here, but I did test it and was able to download the entire file with this technique.

Upvotes: 10

Tomáš Linhart
Tomáš Linhart

Reputation: 10210

You can notice in browser's developer tools under Network/XHR tab that the actual document is being requested. In your particular case given it's on URL http://docplayer.net/storage/75/72489212/72489212.pdf. Now you can try to look into page source to see if you could infer this URL somehow. It seems that XPath //iframe[@id="player_frame"]/@src could be helpful. I haven't checked with other pages, but I think something like this might work (part of your parse method):

...
url_template = 'http://docplayer.net/storage/{0}/{1}/{1}.pdf'
ids = response.xpath('//iframe[@id="player_frame"]/@src').re(r'/docview/([^/]+)/([^/]+)/')
file_url = url_template.format(*ids)
yield scrapy.Request(file_url, callback=self.parse_pdf)
...

Upvotes: 0

Related Questions