gal
gal

Reputation: 302

How to find the href element value in "<a>" tag with ruby

My goal is to find the first result in google search resultes and collect the site link, so I built this script:

require 'hpricot'
require 'open-uri'
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
search_results = search_results.body
doc = Hpricot(search_results)
site = doc.search("a")[16,1]
url = site.to_s
puts url

I get a string like this:

url = <a href="http://en.wikipedia.org/wiki/Gallon" dir="ltr" class="l"><em>Gallon</em> - Wikipedia, the free encyclopedia</a>

But I need only the link (http://en.wikipedia.org/wiki/Gallon) not all the html code... How can I do it? I am using the gems:

require 'hpricot'
require 'open-uri'
require 'mechanize'

Upvotes: 0

Views: 6932

Answers (6)

mikej
mikej

Reputation: 66293

Instead of converting to a string with url = site.to_s do url = site[0].attributes['href']

Upvotes: 1

Jakub Hampl
Jakub Hampl

Reputation: 40553

Since mechanize includes nokogiri you can should skip hpricot altogether. It will slow your code down unnecessarily. You are effectively doing the same thing twice.

require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)

puts search_results.links[16].href

Upvotes: 6

Jonas Elfstr&#246;m
Jonas Elfstr&#246;m

Reputation: 31438

You can get the value of attributes like this

(doc/"a")[16].attributes['href']

but I have to say that the magic number 16 seems brittle.

You are also not supposed to scrape the search results, you should consider using the Custom Search API.

Upvotes: 6

Fareesh Vijayarangam
Fareesh Vijayarangam

Reputation: 5052

Since the input is always going to follow the same format, you could just do:

url.split("href=\"").last.split("\"").first

Upvotes: 0

Nikita Barsukov
Nikita Barsukov

Reputation: 2984

Waitir is a reasonable choice to check the layout of a web page.

require 'rubygems'
require 'watir'

#Launching browser windows and navigating to google
browser = Watir::Browser.new
browser.goto("http://www.google.co.il/")

#Logging to console if a link with href = http://en.wikipedia.org/wiki/Gallon present
puts browser.link(:href, "http://en.wikipedia.org/wiki/Gallon").exists?

Upvotes: 0

Matteo Alessani
Matteo Alessani

Reputation: 10422

try to use:

site = doc.search("a[@href]")[16,1]

Upvotes: 0

Related Questions