ZK Zhao
ZK Zhao

Reputation: 21513

Unable to use Nokogiri to scrape a page

I'm trying to use Nokogiri to scrape this page: http://www.tudou.com/home/_48712163/item

The goal is the video info in this page (titl, href, etc. )

The HTML is:

<div class="pack pack_album2" data-stat-role="ck" data-stat-href="http://www.tudou.com/programs/view/e3jLsLPGct0/"></div>

I aimed at the .pack attribute:

url = 'http://www.tudou.com/home/_48712163/item'
doc = Nokogiri::HTML(open(url) )
puts doc.css("title").text
doc.css(".pack").each do |item|
   #get video info
  title = item.css(".txt a")[0]['title']
  href = item.at(".txt a")[0]['href']
  puts title
  puts href
end

However, the result returned says that .pack is nil class.

In fact, I tried puts doc.css(".page-container").to_s, .page-container is the parent div for .pack. The return result shows that there is no .pack inside.

How can I get the content of .pack?

Upvotes: 0

Views: 369

Answers (2)

1Rhino
1Rhino

Reputation: 298

The website loads content using Ajax.

You can see the Ajax call and handle HTML content in http://js.tudouui.com/v3/dist/js/page/home/v2/main_33.js

Copy and find "pack pack_album2". There is no way to get Ajax content using Nokogiri.

Upvotes: 0

Duck1337
Duck1337

Reputation: 524

You need to load the JavaScript. If you're comfortable using JavaScript, I suggest using Phantomjs. If Ruby is easier for you, you can use Watir:

require 'watir-webdriver'
require 'nokogiri'

$browser = Watir::Browser.start "http://www.tudou.com/home/_48712163/item"

$page_html = Nokogiri::HTML.parse($browser.html)

video_info = $page_html.css("#xpath")

You could run this headless by using the headless gem, depending on what kind of OS you have.

require 'watir-webdriver'
require 'nokogiri'
require 'headless'

headless = Headless.new

headless.start 
$browser = Watir::Browser.start "http://www.tudou.com/home/_48712163/item"

$page_html = Nokogiri::HTML.parse($browser.html)

video_info = $page_html.css("#xpath")

Upvotes: 1

Related Questions