sunnyrjuneja
sunnyrjuneja

Reputation: 6123

Get link and href text from html doc with Nokogiri & Ruby?

I'm trying to use the nokogiri gem to extract all the urls on the page as well their link text and store the link text and url in a hash.

<html>
    <body>
        <a href=#foo>Foo</a>
        <a href=#bar>Bar </a>
    </body>
</html>

I would like to return

{"Foo" => "#foo", "Bar" => "#bar"}

Upvotes: 6

Views: 11976

Answers (2)

Mark Thomas
Mark Thomas

Reputation: 37527

Here's a one-liner:

Hash[doc.xpath('//a[@href]').map {|link| [link.text.strip, link["href"]]}]

#=> {"Foo"=>"#foo", "Bar"=>"#bar"}

Split up a bit to be arguably more readable:

h = {}
doc.xpath('//a[@href]').each do |link|
  h[link.text.strip] = link['href']
end
puts h

#=> {"Foo"=>"#foo", "Bar"=>"#bar"}

Upvotes: 14

mu is too short
mu is too short

Reputation: 434985

Another way:

h = doc.css('a[href]').each_with_object({}) { |n, h| h[n.text.strip] = n['href'] }
# yields {"Foo"=>"#foo", "Bar"=>"#bar"}

And if you're worried that you might have the same text linking to different things then you collect the hrefs in arrays:

h = doc.css('a[href]').each_with_object(Hash.new { |h,k| h[k] = [ ]}) { |n, h| h[n.text.strip] << n['href'] }
# yields {"Foo"=>["#foo"], "Bar"=>["#bar"]}

Upvotes: 2

Related Questions