Reputation:
I would like to do a search in a webpage if I have result than I need a property. Here is the webpage: link text
I am interested if, the header the meta has the property with value "og:title" ot nor, if has I want the content value
If we look at the source of the page, it has a potion of:
<meta
property="og:title" content="Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]" />
so I want a true result for og:title query and a Explore the Titanic Wreck Site via Social Media [EXCLUSIVE] value for next search, how to do it properly
search("/html/head/meta[(@property='og:title']")
doesn't return what I want.
any suggestion?
Upvotes: 0
Views: 1178
Reputation:
Thanks for answers. When I posted my question I couldn't realize I have a mistake in the search. It was Friday evening...
The correct search is
elements = @doc.search("/html/head/meta[@property='og:title']")
(
character from expression before @propertyThis give the:
elements = <meta property="og:title" content="Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]" />
result. Than I am checking if I have something or not, if I have, than I need the content value
if elements.nil?
puts 'not found'
elsif elements.size > 0
puts "Found one, og:title = #{elements}"
content = elements.attr("content");
puts content # this will display the content ( it will be processed)
else
... can come here the flow control? - theoretically yes, but in practice?
end
Upvotes: 1
Reputation: 160551
Your XPath has an error in it, plus is too restrictive:
search("/html/head/meta[(@property='og:title']")
should be:
search("/html/head/meta[@property='og:title']")
to fix the error. I'd simplify it to:
search("//meta[@property='og:title']")
Also, it's not quite clear what you want to do. Do you want to find
<meta
property="og:title"
content="Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]"
/>
and extract the content
parameter? Or do you want to locate the tag, confirm it contains both the "og:title"
property tag and the "Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]"
content, and then do further processing?
That said, often it's simpler to use CSS accessors instead of XPath. I prefer using Nokogiri, which has both XPath and CSS selectors; I'm using CSS below:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://mashable.com/2010/08/06/expedition-titanic'))
(doc % 'meta[property="og:title"]')
=> #<Nokogiri::XML::Element:0x8084ee48 name="meta" attributes=[#<Nokogiri::XML::Attr:0x8084ed58 name="property" value="og:title">, #<Nokogiri::XML::Attr:0x8084ed1c name="content" value="Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]">]>
Nokogiri and Hpricot support the /
and %
shorthand for search
and at
respectively. "Search" returns an array of all matches, and "at" returns only the first match. So, the example above gets the first node using the CSS, showing this is the right track. I'm not sure how to use CSS to match two parameters in the same tag, so I'll go after all <meta>
tags with property="og:title"
, then filter based on the content=
parameter:
(doc / 'meta[property="og:title"]').select{ |n| n['content'][/titanic wreck site/i] }
=> [#<Nokogiri::XML::Element:0x8084ee48 name="meta" attributes=[#<Nokogiri::XML::Attr:0x8084ed58 name="property" value="og:title">, #<Nokogiri::XML::Attr:0x8084ed1c name="content" value="Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]">]>]
At that point we've got the right node in the returned array, so you can extract whatever you want, or dive into its children and sack and pillage. To do that you'll want to use .first
or [0]
to get at the actual node for further processing:
(doc / 'meta[property="og:title"]').select{ |n| n['content'][/titanic wreck site/i] }.first
Update based on OP's response, using Nokogiri still:
>> meta = (doc % 'meta[@property="og:title"]')['content']
>> meta #=> "Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]"
Upvotes: 1