rebbailey
rebbailey

Reputation: 754

How to download each zip file from a url and unpack using rails

Right now I have a URL which is populated with a list of .zip files in the browser. I am trying to use rails to download the files and then open them using Zip::File from the rubyzip gem. Currently I am doing this using the typhoeus gem:

response = Typhoeus.get("http://url_with_zip_files.com")

But the response.response_body is an HTML doc inside a string. I am new to programming so a hint in the right direction using best practices would help a lot.

response.response_body => "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n<html>\n <head>\n  <title>Index of /mainstream/posts</title>\n </head>\n <body>\n<h1>Index of /mainstream/posts</h1>\n<table><tr><th><a href=\"?C=N;O=D\">Name</a></th><th><a href=\"?C=M;O=A\">Last modified</a></th><th><a href=\"?C=S;O=A\">Size</a></th><th><a href=\"?C=D;O=A\">Description</a></th></tr><tr><th colspan=\"4\"><hr></th></tr>\n<tr><td><a href=\"/5Rh5AMTrc4Pv/mainstream/\">Parent Directory</a></td><td>&nbsp;</td><td align=\"right\">  - </td><td>&nbsp;</td></tr>\n<tr><td><a href=\"1476536091739.zip\">1476536091739.zip</a></td><td align=\"right\">15-Oct-2016 16:01  </td><td align=\"right\"> 10M</td><td>&nbsp;</td></tr>\n<tr><td><a href=\"1476536487496.zip\">1476536487496.zip</a></td><td align=\"right\">15-Oct-2016 16:04  </td><td align=\"right\"> 10M</td><td>&nbsp;</td></tr>"

Upvotes: 2

Views: 2201

Answers (2)

Stefan Lyew
Stefan Lyew

Reputation: 417

To break this down you need to:

  1. Get the initial HTML index page with Typhoeus

      base_url = "http://url_with_zip_files.com/"
      response = Typhoeus.get(base_url)
    
  2. Then Use Nokogiri to parse that HTML to extract all the links to the zip files (see: extract links (URLs), with nokogiri in ruby, from a href html tags?)

    doc = Nokogiri::HTML(response)
    links = doc.css('a').map { |link| link['href'] }
    links.map { |link| base_url + '/' + link}
    
    # Should look like:
    # links = ["http://url_with_zip_files.com/1476536091739.zip", "http://url_with_zip_files.com/1476536487496.zip" ...]
    
    # The first link is a link to Parent Directory which you should probably drop 
    # looks like: "/5Rh5AMTrc4Pv/mainstream/"
    
    links.pop
    
  3. Once you have all the links: you then visit all the extracted links to download the zip files with ruby and unzip them (see: Ruby: Download zip file and extract)

     links.each do |link|
       download_and_parse(link)
     end
    
     def download_and_parse(zip_file_link)
       input = Typhoeus.get(zip_file_link).body
       Zip::InputStream.open(StringIO.new(input)) do |io|
          while entry = io.get_next_entry
              puts entry.name
              parse_zip_content io.read
          end
       end
     end
    

If you want to use Typhoeus to stream the file contents from the url to memory see the Typhoeus documentation section titled: "Streaming the response body". You can also use Typhoeus to download all of the files in paralell which would increase your performance.

Upvotes: 2

Steven B
Steven B

Reputation: 61

I believe Nokogiri will be your best bet.

base_url = "http://url_with_zip_files.com/"
doc = Nokogiri::HTML(Typhoeus.get(base_url))
zip_array = []
doc.search('a').each do |link| 
  if link.attr("href").match /.+\.zip/i
   zip_array << Typhoeus.get(base_url + link.attr("href"))
  end
end

Upvotes: 1

Related Questions