Reputation: 3036
I want to extract a specific link from a webpage, searching for it by its text, using Nokogiri:
<div class="links">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
I would like the href of "site 3" and return:
http://example.org/site/3/
Or I would like the href of "site 1" and return:
http://example.org/site/1/
How can I do it?
Upvotes: 3
Views: 2741
Reputation: 160631
Just to document another way we can do this in Ruby, using the URI module:
require 'uri'
html = %q[
<div class="links">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
]
uris = Hash[URI.extract(html).map.with_index{ |u, i| [1 + i, u] }]
=> {
1 => "http://example.org/site/1/'",
2 => "http://example.org/site/2/'",
3 => "http://example.org/site/3/'"
}
uris[1]
=> "http://example.org/site/1/'"
uris[3]
=> "http://example.org/site/3/'"
Under the covers URI.extract uses a regular expression, which isn't the most robust way of finding links in a page, but it is pretty good since a URI usually is a string without whitespace if it is to be useful.
Upvotes: 1
Reputation: 55012
Maybe you will like css style selection better:
doc.at('a[text()="site 1"]')[:href] # exact match
doc.at('a[text()^="site 1"]')[:href] # starts with
doc.at('a[text()*="site 1"]')[:href] # match anywhere
Upvotes: 3
Reputation: 14412
Original:
text = <<TEXT
<div class="links">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
TEXT
link_text = "site 1"
doc = Nokogiri::HTML(text)
p doc.xpath("//a[text()='#{link_text}']/@href").to_s
Updated:
As far as I know Nokogiri's XPath implementation doesn't support regular expressions, for basic starts with
matching there's a function called starts-with
that you can use like this (links starting with "s"):
doc = Nokogiri::HTML(text)
array_of_hrefs = doc.xpath("//a[starts-with(text(), 's')]/@href").map(&:to_s)
Upvotes: 3
Reputation: 4944
require 'nokogiri'
text = "site 1"
doc = Nokogiri::HTML(DATA)
p doc.xpath("//div[@class='links']//a[contains(text(), '#{text}')]/@href").to_s
Upvotes: 1