Fran b
Fran b

Reputation: 3036

Extract a link with Nokogiri from the text of link?

I want to extract a specific link from a webpage, searching for it by its text, using Nokogiri:

<div class="links">
   <a href='http://example.org/site/1/'>site 1</a>
   <a href='http://example.org/site/2/'>site 2</a>
   <a href='http://example.org/site/3/'>site 3</a>
</div>

I would like the href of "site 3" and return:

http://example.org/site/3/

Or I would like the href of "site 1" and return:

http://example.org/site/1/

How can I do it?

Upvotes: 3

Views: 2741

Answers (4)

the Tin Man
the Tin Man

Reputation: 160631

Just to document another way we can do this in Ruby, using the URI module:

require 'uri'

html = %q[
<div class="links">
    <a href='http://example.org/site/1/'>site 1</a>
    <a href='http://example.org/site/2/'>site 2</a>
    <a href='http://example.org/site/3/'>site 3</a>
</div>
]

uris = Hash[URI.extract(html).map.with_index{ |u, i| [1 + i, u] }]

=> {
    1 => "http://example.org/site/1/'",
    2 => "http://example.org/site/2/'",
    3 => "http://example.org/site/3/'"
}

uris[1]
=> "http://example.org/site/1/'"

uris[3]
=> "http://example.org/site/3/'"

Under the covers URI.extract uses a regular expression, which isn't the most robust way of finding links in a page, but it is pretty good since a URI usually is a string without whitespace if it is to be useful.

Upvotes: 1

pguardiario
pguardiario

Reputation: 55012

Maybe you will like css style selection better:

doc.at('a[text()="site 1"]')[:href] # exact match
doc.at('a[text()^="site 1"]')[:href] # starts with
doc.at('a[text()*="site 1"]')[:href] # match anywhere

Upvotes: 3

Jiř&#237; Posp&#237;šil
Jiř&#237; Posp&#237;šil

Reputation: 14412

Original:

text = <<TEXT
<div class="links">
  <a href='http://example.org/site/1/'>site 1</a>
  <a href='http://example.org/site/2/'>site 2</a>
  <a href='http://example.org/site/3/'>site 3</a>
</div>
TEXT

link_text = "site 1"

doc = Nokogiri::HTML(text)
p doc.xpath("//a[text()='#{link_text}']/@href").to_s

Updated:

As far as I know Nokogiri's XPath implementation doesn't support regular expressions, for basic starts with matching there's a function called starts-with that you can use like this (links starting with "s"):

doc = Nokogiri::HTML(text)
array_of_hrefs = doc.xpath("//a[starts-with(text(), 's')]/@href").map(&:to_s)

Upvotes: 3

Eugene Rourke
Eugene Rourke

Reputation: 4944

require 'nokogiri'

text = "site 1"

doc = Nokogiri::HTML(DATA)
p doc.xpath("//div[@class='links']//a[contains(text(), '#{text}')]/@href").to_s

Upvotes: 1

Related Questions