Jason Ayes
Jason Ayes

Reputation: 19

How to scrape in Ruby when the page elements keep changing and shifting.

I'm writing a program to download the images from an imgur album: I had just begun to write the actual image-link-code:

#The imports.
require 'open-uri'
require 'nokogiri'

url = ARGV[0]

#The title.
open(url) do |f|
  $doc = Nokogiri::HTML(f)
  title = $doc.at_css('title').text.strip.clone
  re = /\/[a]\/\w{5}/
  s2 = url.match re
  puts title
  puts s2
end



href = $doc.xpath("//img")
puts href

When I ran into a major problem: the page I download isn't the page source.

For example: This album: https://i.sstatic.net/3swCc.jpg has the following code for it's images:

<span class="post-grid-image pointer" data-href="//i.imgur.com/zh6I7k2.png" data-title="" style="transform: translate(0px, 0px) scale(1); z-index: 0; background-image: url(&quot;//i.imgur.com/zh6I7k2b.jpg&quot;);"></span>

And yet when I look in the page source, or run the code for span elements, all the images are missing:

            <div class="post-images is-owner">









            <div class="post-action nodisplay"></div>

            </div>
        </div>

The HTML is active, and changes based on how my browser is. There aren't any images in the page source, and everything's loaded using some weird java system. How can I scrape active elements, when there aren't even any active elements to scrape?

And what's the difference between inspect and 'view-source'? That's what started this whole problem.

Upvotes: 1

Views: 75

Answers (1)

the Tin Man
the Tin Man

Reputation: 160601

It's dynamic HTML. Mechanize and/or Nokogiri can't help you unless you can build the final version of the page then pass it to them.

Instead you have to use something that can interpret JavaScript and apply CSS, such as a browser. The WATIR project would be the first thing to investigate. "inspect" and "view-source" both reflect the page after the browser has processed the JavaScript and CSS in it, which often has little bearing on what the actual page looked like prior to that. Search SO for [ruby] [watir].

Use wget, curl or nokogiri to retrieve the page so you can see the raw HTML.

$doc.at_css('title') should be using the title method: doc.title.

Don't use a global like $doc. Learn about variable scoping then decide if a global is the right way to go.

Instead of open with a block:

open(url) do |f|
  $doc = Nokogiri::HTML(f)
  title = $doc.at_css('title').text.strip.clone
  re = /\/[a]\/\w{5}/
  s2 = url.match re
  puts title
  puts s2
end

Do this instead:

doc = Nokogiri::HTML(open(url))
title = doc.title

When working with URIs/URLs, use the built-in URI class since it's a well debugged tool:

require 'uri'

url = URI.parse('http://imgur.com/a/tGRvr/layout/grid')

url.path # => "/a/tGRvr/layout/grid"
  .split('/') # => ["", "a", "tGRvr", "layout", "grid"]

Knowing that, you can do:

url.path.split('/')[2] # => "tGRvr"

Upvotes: 1

Related Questions