Reputation: 19
I'm writing a program to download the images from an imgur album: I had just begun to write the actual image-link-code:
#The imports.
require 'open-uri'
require 'nokogiri'
url = ARGV[0]
#The title.
open(url) do |f|
$doc = Nokogiri::HTML(f)
title = $doc.at_css('title').text.strip.clone
re = /\/[a]\/\w{5}/
s2 = url.match re
puts title
puts s2
end
href = $doc.xpath("//img")
puts href
When I ran into a major problem: the page I download isn't the page source.
For example: This album: https://i.sstatic.net/3swCc.jpg has the following code for it's images:
<span class="post-grid-image pointer" data-href="//i.imgur.com/zh6I7k2.png" data-title="" style="transform: translate(0px, 0px) scale(1); z-index: 0; background-image: url("//i.imgur.com/zh6I7k2b.jpg");"></span>
And yet when I look in the page source, or run the code for span elements, all the images are missing:
<div class="post-images is-owner">
<div class="post-action nodisplay"></div>
</div>
</div>
The HTML is active, and changes based on how my browser is. There aren't any images in the page source, and everything's loaded using some weird java system. How can I scrape active elements, when there aren't even any active elements to scrape?
And what's the difference between inspect
and 'view-source'? That's what started this whole problem.
Upvotes: 1
Views: 75
Reputation: 160601
It's dynamic HTML. Mechanize and/or Nokogiri can't help you unless you can build the final version of the page then pass it to them.
Instead you have to use something that can interpret JavaScript and apply CSS, such as a browser. The WATIR project would be the first thing to investigate. "inspect" and "view-source" both reflect the page after the browser has processed the JavaScript and CSS in it, which often has little bearing on what the actual page looked like prior to that. Search SO for [ruby] [watir]
.
Use wget
, curl
or nokogiri
to retrieve the page so you can see the raw HTML.
$doc.at_css('title')
should be using the title
method: doc.title
.
Don't use a global like $doc
. Learn about variable scoping then decide if a global is the right way to go.
Instead of open
with a block:
open(url) do |f|
$doc = Nokogiri::HTML(f)
title = $doc.at_css('title').text.strip.clone
re = /\/[a]\/\w{5}/
s2 = url.match re
puts title
puts s2
end
Do this instead:
doc = Nokogiri::HTML(open(url))
title = doc.title
When working with URIs/URLs, use the built-in URI class since it's a well debugged tool:
require 'uri'
url = URI.parse('http://imgur.com/a/tGRvr/layout/grid')
url.path # => "/a/tGRvr/layout/grid"
.split('/') # => ["", "a", "tGRvr", "layout", "grid"]
Knowing that, you can do:
url.path.split('/')[2] # => "tGRvr"
Upvotes: 1