Reputation: 21513
I'm trying to use Nokogiri to scrape this page: http://www.tudou.com/home/_48712163/item
The goal is the video info in this page (titl, href, etc. )
The HTML is:
<div class="pack pack_album2" data-stat-role="ck" data-stat-href="http://www.tudou.com/programs/view/e3jLsLPGct0/"></div>
I aimed at the .pack
attribute:
url = 'http://www.tudou.com/home/_48712163/item'
doc = Nokogiri::HTML(open(url) )
puts doc.css("title").text
doc.css(".pack").each do |item|
#get video info
title = item.css(".txt a")[0]['title']
href = item.at(".txt a")[0]['href']
puts title
puts href
end
However, the result returned says that .pack
is nil class.
In fact, I tried puts doc.css(".page-container").to_s
, .page-container
is the parent div for .pack
. The return result shows that there is no .pack
inside.
How can I get the content of .pack
?
Upvotes: 0
Views: 369
Reputation: 298
The website loads content using Ajax.
You can see the Ajax call and handle HTML content in http://js.tudouui.com/v3/dist/js/page/home/v2/main_33.js
Copy and find "pack pack_album2". There is no way to get Ajax content using Nokogiri.
Upvotes: 0
Reputation: 524
You need to load the JavaScript. If you're comfortable using JavaScript, I suggest using Phantomjs. If Ruby is easier for you, you can use Watir:
require 'watir-webdriver'
require 'nokogiri'
$browser = Watir::Browser.start "http://www.tudou.com/home/_48712163/item"
$page_html = Nokogiri::HTML.parse($browser.html)
video_info = $page_html.css("#xpath")
You could run this headless by using the headless gem, depending on what kind of OS you have.
require 'watir-webdriver'
require 'nokogiri'
require 'headless'
headless = Headless.new
headless.start
$browser = Watir::Browser.start "http://www.tudou.com/home/_48712163/item"
$page_html = Nokogiri::HTML.parse($browser.html)
video_info = $page_html.css("#xpath")
Upvotes: 1