Sushil
Sushil

Reputation: 501

How to download pdf file in ruby without .pdf in the link

I need to download a pdf from a website which does not provide a link ending with (.pdf) using ruby. Manually, when i click on the link to download the pdf, it takes me to a new page and the dialog box to save/open the file appears after some time.

Please help me in downloading the file.

The link

Upvotes: 0

Views: 5391

Answers (2)

roxxypoxxy
roxxypoxxy

Reputation: 3121

You an do this

require 'open-uri'
File.open('my_file_name.pdf', "wb") do |file|
  file.write open('http://someurl.com/2013-1-2/somefile/download').read
end

I have been doing this for my projects and it works.

Upvotes: 6

Peter Klipfel
Peter Klipfel

Reputation: 5178

If you just need a simple ruby script to do it, I'd just run wget. Like this exec 'wget "http://path.to.the.file/and/some/params"'

At that point though, you might as well run wget.

The other way, is to just run a get on the page that you know the pdf is at

source = Net::HTTP.get("http://the.website.com", "/and/some/params")

There are a number of other http clients that you could use, but as long as you make a get request to the endpoint that the pdf is at, it should give you the raw data. Then you can just rename the file, and you'll have the pdf

In your case, I ran the following commands to get the pdf

wget http://www.lawcommission.gov.np/en/documents/prevailing-laws/constitution/func-download/129/chk,d8c4644b0f086a04d8d363cb86fb1647/no_html,1/
mv index.html thefile.pdf

Then open the pdf. Note that these are linux commands. If you want to get the file with a ruby script, you could use something like what I previously mentioned.

Update:

There is an added complication that was not initially stated, which is that the url to the pdf changes every time there is an update to the pdf. In order to make this work, you probably want to do something involving web scraping. I suggest nokogiri. This way you can look at the page where the download is and then perform a get request on the desired URL. Furthermore, the server that hosts the pdf is misconfigured, and breaks chrome within a few seconds of opening the page.

How to solve this problem: I went to the site, and refreshed it. Then broke the connection to the server (press the X where there would otherwise be a refresh button). Then right click next to the download link, and select inspect element. Then browse the dom to find something that is definitively identifying (like an id). Thankfully, I found something <strong id="telecharger"> Download</strong>. This means that you can use something like page.css('strong#telecharger')[0].parent['href'] This should give you a URL. Then you can perform a get request as described above. I don't have time to make the script for you (too much work to do), but this should be enough to solve the problem.

Upvotes: 0

Related Questions