Reputation: 651
I need to parse out the image URL from HTML much like the following:
<p><a href="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" ><img class="aligncenter size-full wp-image-12313" alt="Example image Name" src="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" width="630" height="119" /></a></p>
So far I am using Nokogiri to parse out <h2>
tags with:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://blog.website.com/"))
headers = page.css('h2')
puts headers.text
I have two questions:
1. Header 1 image_url 1 image_url 2 (if any) 2. Header 2 2image_url 1 2image_url 2 (if any)
And so far I haven't been able to print my headers in this nice format. How can I do so?
<h2><a href="http://blog.website.com/2013/02/15/images/" rel="bookmark" title="Permanent Link to Blog Post">Blog Post</a></h2>
<p class="post_author"><em>by</em> author</p>
<div class="format_text">
<p style="text-align: left;">Blog Content </p>
<p style="text-align: left;"> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><a href="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" ><img class="alignnone size-full wp-image-23382" alt="image2" src="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" width="630" height="210" /></a></p>
<p style="text-align: left;">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Items: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvaf812e3" target="_blank">Items for Spring</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">More Items: <a href="http://www.website.com/threads#/show/thread/A_abv2a6822e2" target="_blank">Lorem Ipsum</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Still more items: <a href="http://www.website.com/threads#/show/thread/A_abv7af882e3" target="_blank">Items:</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Lorem ipsum: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvea6832e8" target="_blank">Items</a></b></p>
<p style="text-align: center;">Lorem Ipusm</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">
</div>
<p class="to_comments"><span class="date">February 15, 2013</span> <span class="num_comments"><a href="http://blog.website.com/2013/02/15/Blog-post/#respond" title="Comment on Blog Post">No Comments</a></span></p>
Upvotes: 3
Views: 7716
Reputation: 11
I did something similiar once (I wanted the exact same output actually). This solution is pretty easy to follow:
Depending on how the DOM is structured, you could do something like:
body = page.css('div.format_text')
headers = page.css('div#content_inner h2 a')
post_counter = 1
body.each_with_index do |body,index|
header = headers[index]
puts "#{post_counter}. " + header
body.css('p a img, div > img').each{|img| puts img['src'] if img['src'].match(/\Ahttp/) }
post_counter += 1
end
So basically, you're checking every header with 1 or more images. The page I was parsing had the headers outside of the image divs, which is why I used two different variables to find them (body / headers). Also, I targeted two classes when looking for images, as this is the way this particular DOM was structured.
This should give you a nice clean output like you wanted.
Hope this helps!
Upvotes: 0
Reputation: 651
Code that I ended up using. Feel free to critique (I'll probably learn from it):
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://blog.website.com/"))
doc.xpath('//h2/a[@rel = "bookmark"]').each_with_index do |header, i|
puts i+1
puts " Title: #{header.text}"
puts " Image 1: #{header.xpath('following::img[1]')[0]["src"]}"
puts " Image 2: #{header.xpath('following::img[2]')[0]["src"]}"
end
Upvotes: 0
Reputation: 37517
To get images, simply look for the img
tags with a src
attribute.
If you want the h2
associated with each image, you can do this:
doc.xpath('//img').each do |img|
puts "Header: #{img.xpath('preceding::h2[1]').text}"
puts " Image: #{img['src']}"
end
Note that a switch to XPath was in order for the preceding::
axis.
EDIT
To group by header, you can put them in a hash:
headers = Hash.new{|h,k| h[k] = []}
doc.xpath('//img').each do |img|
header = img.xpath('preceding::h2[1]').text
image = img['src']
headers[header] << image
end
To get the output you've prescribed:
headers.each do |h,urls|
puts "#{h} #{urls.join(' ')}"
end
Upvotes: 5
Reputation: 54984
I think it makes more sense to group by h2 first:
doc.search('h2').each_with_index do |h2, i|
puts "#{i+1}."
puts h2.text
h2.search('+ p + div > p[3] img').each do |img|
puts img['src']
end
end
Upvotes: 6