Reputation: 754
Right now I have a URL which is populated with a list of .zip files in the browser. I am trying to use rails to download the files and then open them using Zip::File
from the rubyzip
gem. Currently I am doing this using the typhoeus
gem:
response = Typhoeus.get("http://url_with_zip_files.com")
But the response.response_body
is an HTML doc inside a string. I am new to programming so a hint in the right direction using best practices would help a lot.
response.response_body => "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n<html>\n <head>\n <title>Index of /mainstream/posts</title>\n </head>\n <body>\n<h1>Index of /mainstream/posts</h1>\n<table><tr><th><a href=\"?C=N;O=D\">Name</a></th><th><a href=\"?C=M;O=A\">Last modified</a></th><th><a href=\"?C=S;O=A\">Size</a></th><th><a href=\"?C=D;O=A\">Description</a></th></tr><tr><th colspan=\"4\"><hr></th></tr>\n<tr><td><a href=\"/5Rh5AMTrc4Pv/mainstream/\">Parent Directory</a></td><td> </td><td align=\"right\"> - </td><td> </td></tr>\n<tr><td><a href=\"1476536091739.zip\">1476536091739.zip</a></td><td align=\"right\">15-Oct-2016 16:01 </td><td align=\"right\"> 10M</td><td> </td></tr>\n<tr><td><a href=\"1476536487496.zip\">1476536487496.zip</a></td><td align=\"right\">15-Oct-2016 16:04 </td><td align=\"right\"> 10M</td><td> </td></tr>"
Upvotes: 2
Views: 2201
Reputation: 417
To break this down you need to:
Get the initial HTML index page with Typhoeus
base_url = "http://url_with_zip_files.com/"
response = Typhoeus.get(base_url)
Then Use Nokogiri to parse that HTML to extract all the links to the zip files (see: extract links (URLs), with nokogiri in ruby, from a href html tags?)
doc = Nokogiri::HTML(response)
links = doc.css('a').map { |link| link['href'] }
links.map { |link| base_url + '/' + link}
# Should look like:
# links = ["http://url_with_zip_files.com/1476536091739.zip", "http://url_with_zip_files.com/1476536487496.zip" ...]
# The first link is a link to Parent Directory which you should probably drop
# looks like: "/5Rh5AMTrc4Pv/mainstream/"
links.pop
Once you have all the links: you then visit all the extracted links to download the zip files with ruby and unzip them (see: Ruby: Download zip file and extract)
links.each do |link|
download_and_parse(link)
end
def download_and_parse(zip_file_link)
input = Typhoeus.get(zip_file_link).body
Zip::InputStream.open(StringIO.new(input)) do |io|
while entry = io.get_next_entry
puts entry.name
parse_zip_content io.read
end
end
end
If you want to use Typhoeus to stream the file contents from the url to memory see the Typhoeus documentation section titled: "Streaming the response body". You can also use Typhoeus to download all of the files in paralell which would increase your performance.
Upvotes: 2
Reputation: 61
I believe Nokogiri will be your best bet.
base_url = "http://url_with_zip_files.com/"
doc = Nokogiri::HTML(Typhoeus.get(base_url))
zip_array = []
doc.search('a').each do |link|
if link.attr("href").match /.+\.zip/i
zip_array << Typhoeus.get(base_url + link.attr("href"))
end
end
Upvotes: 1